Eric Lee / smarc-fsl-linux-kernel

24 Mar, 2019

1 commit

09417dd35 mm/memory.c: do_fault: avoid usage of stale vm_area_struct ... Browse Code »

commit fc8efd2ddfed3f343c11b693e87140ff358d7ff5 upstream.

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
Call Trace:
([] (null))
[] lock_acquire+0xec/0x258
[] _raw_spin_lock_bh+0x5c/0x98
[] page_table_free+0x48/0x1a8
[] do_fault+0xdc/0x670
[] __handle_mm_fault+0x416/0x5f0
[] handle_mm_fault+0x1b0/0x320
[] do_dat_exception+0x19c/0x2c8
[] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because "0" is a
valid address on s390 (see S390_lowcore), it keeps going until it
eventually crashes in lockdep's lock_acquire. This crash is
reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale. Because
mmap_sem may be released, other threads can come in, call munmap() and
cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
re-used:

handle_mm_fault |
__handle_mm_fault |
do_fault |
vma = vmf->vma |
do_read_fault |
__do_fault |
vma->vm_ops->fault(vmf); |
mmap_sem is released |
|
| do_munmap()
| remove_vma_list()
| remove_vma()
| vm_area_free()
| # vma is released
| ...
| # same vma is allocated
| # from kmem cache
| do_mmap()
| vm_area_alloc()
| memset(vma, 0, ...)
|
pte_free(vma->vm_mm, ...); |
page_table_free |
spin_lock_bh(&mm->context.lock);|
|

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
Signed-off-by: Jan Stancek
Reviewed-by: Andrea Arcangeli
Reviewed-by: Matthew Wilcox
Acked-by: Rafael Aquini
Reviewed-by: Minchan Kim
Acked-by: Kirill A. Shutemov
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Huang Ying
Cc: Souptick Joarder
Cc: Jerome Glisse
Cc: Aneesh Kumar K.V
Cc: David Hildenbrand
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jan Stancek
2019-03-24 03:10:04 +0800

17 Jan, 2019

1 commit

97b02b632 mm, memcg: fix reclaim deadlock with writeback ... Browse Code »

commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

task1:
wait_on_page_bit+0x82/0xa0
shrink_page_list+0x907/0x960
shrink_inactive_list+0x2c7/0x680
shrink_node_memcg+0x404/0x830
shrink_node+0xd8/0x300
do_try_to_free_pages+0x10d/0x330
try_to_free_mem_cgroup_pages+0xd5/0x1b0
try_charge+0x14d/0x720
memcg_kmem_charge_memcg+0x3c/0xa0
memcg_kmem_charge+0x7e/0xd0
__alloc_pages_nodemask+0x178/0x260
alloc_pages_current+0x95/0x140
pte_alloc_one+0x17/0x40
__pte_alloc+0x1e/0x110
alloc_set_pte+0x5fe/0xc20
do_fault+0x103/0x970
handle_mm_fault+0x61e/0xd10
__do_page_fault+0x252/0x4d0
do_page_fault+0x30/0x80
page_fault+0x28/0x30

task2:
__lock_page+0x86/0xa0
mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
ext4_writepages+0x479/0xd60
do_writepages+0x1e/0x30
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x272/0x600
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x268/0x300
wb_workfn+0xb4/0x390
process_one_work+0x189/0x420
worker_thread+0x4e/0x4b0
kthread+0xe6/0x100
ret_from_fork+0x41/0x50

He adds
"task1 is waiting for the PageWriteback bit of the page that task2 has
collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path. So
this is essentially ABBA deadlock:

lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_pne
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)

# flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko
Reported-by: Liu Bo
Debugged-by: Liu Bo
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Reviewed-by: Liu Bo
Cc: Jan Kara
Cc: Dave Chinner
Cc: Theodore Ts'o
Cc: Vladimir Davydov
Cc: Shakeel Butt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2019-01-17 05:04:33 +0800

01 Dec, 2018

1 commit

5999609a9 mm/memory.c: recheck page table entry with page table lock held ... Browse Code »

commit ff09d7ec9786be4ad7589aa987d7dc66e2dd9160 upstream.

We clear the pte temporarily during read/modify/write update of the pte.
If we take a page fault while the pte is cleared, the application can get
SIGBUS. One such case is with remap_pfn_range without a backing
vm_ops->fault callback. do_fault will return SIGBUS in that case.

cpu 0 cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
. page fault.
.
.
prep_modify_prot_commit()

Fix this by taking page table lock and rechecking for pte_none.

[aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Cc: Willem de Bruijn
Cc: Eric Dumazet
Cc: Ido Schimmel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Aneesh Kumar K.V
2018-12-01 16:37:28 +0800

26 Aug, 2018

1 commit

1b2de5d03 mm/cow: don't bother write protecting already write-protected pages ... Browse Code »

This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.

That bug was fixed by Eric Biederman's signal handling series
culminating in commit c3ad2c3b02e9 ("signal: Don't restart fork when
signals come in"), but the unnecessary work for repeated forks is still
work just fixing, particularly since the fix is trivial.

Cc: Eric Biederman
Cc: Oleg Nesterov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-08-26 04:15:03 +0800

24 Aug, 2018

6 commits

33e17876e Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge yet more updates from Andrew Morton:

- the rest of MM

- various misc fixes and tweaks

* emailed patches from Andrew Morton : (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...

Linus Torvalds
2018-08-24 10:20:12 +0800
2b7403035 mm: Change return type int to vm_fault_t for fault handlers ... Browse Code »

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-08-24 09:48:44 +0800
fd1102f0a mm: mmu_notifier fix for tlb_end_vma ... Browse Code »

The generic tlb_end_vma does not call invalidate_range mmu notifier, and
it resets resets the mmu_gather range, which means the notifier won't be
called on part of the range in case of an unmap that spans multiple
vmas.

ARM64 seems to be the only arch I could see that has notifiers and uses
the generic tlb_end_vma. I have not actually tested it.

[ Catalin and Will point out that ARM64 currently only uses the
notifiers for KVM, which doesn't use the ->invalidate_range()
callback right now, so it's a bug, but one that happens to
not affect them. So not necessary for stable. - Linus ]

Signed-off-by: Nicholas Piggin
Acked-by: Catalin Marinas
Acked-by: Will Deacon
Signed-off-by: Linus Torvalds

Nicholas Piggin
2018-08-24 02:55:58 +0800
d86564a2f mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE ... Browse Code »

Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
be able to use the generic code again.

[ Modified to be on top of Nick's cleanups, which simplified this patch
now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Nicholas Piggin
Cc: David Miller
Cc: Will Deacon
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Peter Zijlstra
2018-08-24 02:55:58 +0800
a6f572084 mm/tlb: Remove tlb_remove_table() non-concurrent condition ... Browse Code »

Will noted that only checking mm_users is incorrect; we should also
check mm_count in order to cover CPUs that have a lazy reference to
this mm (and could do speculative TLB operations).

If removing this turns out to be a performance issue, we can
re-instate a more complete check, but in tlb_table_flush() eliding the
call_rcu_sched().

Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
Reported-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Acked-by: Will Deacon
Cc: Nicholas Piggin
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Peter Zijlstra
2018-08-24 02:55:58 +0800
db7ddef30 mm: move tlb_table_flush to tlb_flush_mmu_free ... Browse Code »

There is no need to call this from tlb_flush_mmu_tlbonly, it logically
belongs with tlb_flush_mmu_free. This makes future fixes simpler.

[ This was originally done to allow code consolidation for the
mmu_notifier fix, but it also ends up helping simplify the
HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

Signed-off-by: Nicholas Piggin
Acked-by: Will Deacon
Cc: Peter Zijlstra
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Nicholas Piggin
2018-08-24 02:53:24 +0800

23 Aug, 2018

1 commit

52a288c73 x86/mm/tlb: Revert the recent lazy TLB patches ... Browse Code »

Revert commits:

95b0e6357d3e x86/mm/tlb: Always use lazy TLB mode
64482aafe55f x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
ac0315896970 x86/mm/tlb: Make lazy TLB mode lazier
61d0beb5796a x86/mm/tlb: Restructure switch_mm_irqs_off()
2ff6ddf19c0e x86/mm/tlb: Leave lazy TLB mode at page table free time

In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting. We'll try again later.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Peter Zijlstra
2018-08-23 09:22:04 +0800

18 Aug, 2018

6 commits

50c150f26 Revert "mm: always flush VMA ranges affected by zap_page_range" ... Browse Code »

There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.

This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.

This was noticed in the summer of 2017, at which time two solutions
were created:

56236a59556c ("mm: refactor TLB gathering API")
99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")

We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.

This reverts 4647706ebeee6e50 ("mm: always flush VMA ranges affected by
zap_page_range")

Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Andy Lutomirski
Cc: Michal Hocko
Cc: Minchan Kim
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: Nicholas Piggin
Cc: Nadav Amit
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2018-08-18 07:20:31 +0800
29ef680ae memcg, oom: move out_of_memory back to the charge path ... Browse Code »

Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory). This
in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible. mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace. Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance. Things have changed since then, though. We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore. Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though. If the oom
killer is not able to make any forward progress - e.g. because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks. We have basically two options
here. Either we fail the charge with ENOMEM or force the charge and
allow overcharge. The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone. Basically the same reason why the page
allocator doesn't fail allocations under such conditions. The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system. E.g. allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Greg Thelen
Cc: Johannes Weiner
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-08-18 07:20:30 +0800
c9f4cd713 mm, huge page: copy target sub-page last when copy huge page ... Browse Code »

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. The similar order
changing helps huge page copying too. That is implemented in this
patch. Because we have put the order algorithm into a separate
function, the implementation is quite simple.

The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on
transparent huge page.

With this patch, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write. For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Mike Kravetz
Cc: Andi Kleen
Cc: Jan Kara
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Shaohua Li
Cc: Christopher Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-18 07:20:29 +0800
c6ddfb6c5 mm, clear_huge_page: move order algorithm into a separate function ... Browse Code »

Patch series "mm, huge page: Copy target sub-page last when copy huge
page", v2.

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. The similar order
changing helps huge page copying too. That is implemented in this
patchset.

The patchset is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patchset, we have tested it with vm-scalability run on
transparent huge page.

With this patchset, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write. For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

This patch (of 4):

In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. This optimization could
be applied to copying huge page too with the same order algorithm. To
avoid code duplication and reduce maintenance overhead, in this patch,
the order algorithm is moved out of clear_huge_page() into a separate
function: process_huge_page(). So that we can use it for copying huge
page too.

This will change the direct calls to clear_user_highpage() into the
indirect calls. But with the proper inline support of the compilers,
the indirect call will be optimized to be the direct call. Our tests
show no performance change with the patch.

This patch is a code cleanup without functionality change.

Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Suggested-by: Mike Kravetz
Reviewed-by: Mike Kravetz
Cc: Andi Kleen
Cc: Jan Kara
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Shaohua Li
Cc: Christopher Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-18 07:20:29 +0800
fadae2953 thp: use mm_file_counter to determine update which rss counter ... Browse Code »

Since commit eca56ff906bd ("mm, shmem: add internal shmem resident
memory accounting"), MM_SHMEMPAGES is added to separate the shmem
accounting from regular files. So, all shmem pages should be accounted
to MM_SHMEMPAGES instead of MM_FILEPAGES.

And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
shmem thp pages should be not treated differently. Account them to
MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
keep consistent with normal 4K shmem pages.

This will not change the rss counter of processes since shmem pages are
still a part of it.

The /proc/pid/status and /proc/pid/statm counters will however be more
accurate wrt shmem usage, as originally intended. And as eca56ff906bd
("mm, shmem: add internal shmem resident memory accounting") mentioned,
oom also could report more accurate "shmem-rss".

Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2018-08-18 07:20:28 +0800
e1fb4a086 dax: remove VM_MIXEDMAP for fsdax and device dax ... Browse Code »

This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang
Acked-by: Dan Williams
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jiang
2018-08-18 07:20:27 +0800

15 Aug, 2018

2 commits

73ba2fb33 Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.

This pull request contains:

- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)

- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes

- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)

- Series improving how we handle sense information on the stack
(Kees)

- Lightnvm fixes and updates/improvements (Mathias/Javier et al)

- Zoned device support for null_blk (Matias)

- AIX partition fixes (Mauricio Faria de Oliveira)

- DIF checksum code made generic (Max Gurtovoy)

- Add support for discard in iostats (Michael Callahan / Tejun)

- Set of updates for BFQ (Paolo)

- Removal of async write support for bsg (Christoph)

- Bio page dirtying and clone fixups (Christoph)

- Set of bcache fix/changes (via Coly)

- Series improving blk-mq queue setup/teardown speed (Ming)

- Series improving merging performance on blk-mq (Ming)

- Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...

Linus Torvalds
2018-08-15 01:23:25 +0800
958f338e9 Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.

If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.

While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.

While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.

The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

The mitigations provided by this pull request include:

- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.

- Hypervisor protection by flushing L1 Data Cache on VMENTER.

- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs

- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs

- Extensive documentation about L1TF including various degrees of
mitigations.

Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.

There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"

* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...

Linus Torvalds
2018-08-15 00:46:06 +0800

14 Aug, 2018

1 commit

203b4fc90 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 mm updates from Thomas Gleixner:

- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads

- Small cleanups and improvements all over the place

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE

Linus Torvalds
2018-08-14 07:29:35 +0800

11 Aug, 2018

1 commit

24eee1e4c mm/memory.c: check return value of ioremap_prot ... Browse Code »

ioremap_prot() can return NULL which could lead to an oops.

Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
Signed-off-by: chen jie
Reviewed-by: Andrew Morton
Cc: Li Zefan
Cc: chenjie
Cc: Yang Shi
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jie@chenjie6@huwei.com
2018-08-11 11:19:58 +0800

02 Aug, 2018

1 commit

53406ed1b mm: delete historical BUG from zap_pmd_range() ... Browse Code »

Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
that mmap_sem must be held when splitting an "anonymous" vma there.
Whether that's still strictly true nowadays is not entirely clear,
but the danger of sometimes crashing on the BUG is now fairly clear.

Even with the new stricter rules for anonymous vma marking, the
condition it checks for can possible trigger. Commit 44960f2a7b63
("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
pages") is good, and originally I thought it was safe from that
VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
insists on VM_SHARED.

But after I read John's earlier mail, drawing attention to the
vfs_fallocate() in there: I may be wrong, and I don't know if Android
has THP in the config anyway, but it looks to me like an
unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
the VM_BUG_ON_VMA(), once it's vma_is_anonymous().

Signed-off-by: Hugh Dickins
Cc: John Stultz
Cc: Kirill Shutemov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2018-08-02 03:23:45 +0800

17 Jul, 2018

1 commit

2ff6ddf19 x86/mm/tlb: Leave lazy TLB mode at page table free time ... Browse Code »

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Tested-by: Song Liu
Signed-off-by: Rik van Riel
Acked-by: Dave Hansen
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2018-07-17 15:35:31 +0800

09 Jul, 2018

1 commit

2cf855837 memcontrol: schedule throttling if we are congested ... Browse Code »

Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo
Signed-off-by: Josef Bacik
Acked-by: Johannes Weiner
Acked-by: Andrew Morton
Signed-off-by: Jens Axboe

Tejun Heo
2018-07-09 23:07:54 +0800

21 Jun, 2018

1 commit

42e4089c7 x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings ... Browse Code »

For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.

Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.

To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

Valid page mappings are allowed because the system is then unsafe anyways.

It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().

For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Josh Poimboeuf
Acked-by: Dave Hansen

Andi Kleen
2018-06-21 01:10:01 +0800

08 Jun, 2018

4 commits

68abbe729 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few misc things

- ocfs2 updates

- v9fs updates

- MM

- procfs updates

- lib/ updates

- autofs updates

* emailed patches from Andrew Morton : (118 commits)
autofs: small cleanup in autofs_getpath()
autofs: clean up includes
autofs: comment on selinux changes needed for module autoload
autofs: update MAINTAINERS entry for autofs
autofs: use autofs instead of autofs4 in documentation
autofs: rename autofs documentation files
autofs: create autofs Kconfig and Makefile
autofs: delete fs/autofs4 source files
autofs: update fs/autofs4/Makefile
autofs: update fs/autofs4/Kconfig
autofs: copy autofs4 to autofs
autofs4: use autofs instead of autofs4 everywhere
autofs4: merge auto_fs.h and auto_fs4.h
fs/binfmt_misc.c: do not allow offset overflow
checkpatch: improve patch recognition
lib/ucs2_string.c: add MODULE_LICENSE()
lib/mpi: headers cleanup
lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
lib/idr.c: remove simple_ida_lock
lib/bitmap.c: micro-optimization for __bitmap_complement()
...

Linus Torvalds
2018-06-08 09:39:37 +0800
00b3a331f mm: remove odd HAVE_PTE_SPECIAL ... Browse Code »

Remove the additional define HAVE_PTE_SPECIAL and rely directly on
CONFIG_ARCH_HAS_PTE_SPECIAL.

There is no functional change introduced by this patch

Link: http://lkml.kernel.org/r/1523533733-25437-1-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour
Acked-by: David Rientjes
Reviewed-by: Andrew Morton
Cc: Jerome Glisse
Cc: Michal Hocko
Cc: Christophe LEROY
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laurent Dufour
2018-06-08 08:34:35 +0800
3010a5ea6 mm: introduce ARCH_HAS_PTE_SPECIAL ... Browse Code »

Currently the PTE special supports is turned on in per architecture
header files. Most of the time, it is defined in
arch/*/include/asm/pgtable.h depending or not on some other per
architecture static definition.

This patch introduce a new configuration variable to manage this
directly in the Kconfig files. It would later replace
__HAVE_ARCH_PTE_SPECIAL.

Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:

arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.

sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64

There is no functional change introduced by this patch.

Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour
Suggested-by: Jerome Glisse
Reviewed-by: Jerome Glisse
Acked-by: David Rientjes
Cc: Michal Hocko
Cc: "Aneesh Kumar K . V"
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: David S. Miller
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Vineet Gupta
Cc: Palmer Dabbelt
Cc: Albert Ou
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: David Rientjes
Cc: Robin Murphy
Cc: Christophe LEROY
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laurent Dufour
2018-06-08 08:34:35 +0800
ab77dab46 fs/dax.c: use new return type vm_fault_t ... Browse Code »

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

commit 1c8f422059ae ("mm: change return type to vm_fault_t")

There was an existing bug inside dax_load_hole() if vm_insert_mixed had
failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of
VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed.

vm_insert_mixed_mkwrite has inefficiency when it returns an error value,
driver has to convert it to vm_fault_t type. With new
vmf_insert_mixed_mkwrite() this limitation will be addressed.

Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Jan Kara
Reviewed-by: Matthew Wilcox
Reviewed-by: Ross Zwisler
Cc: Alexander Viro
Cc: Dan Williams
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-06-08 08:34:33 +0800

01 Jun, 2018

1 commit

27d036e33 mm: Remove return value of zap_vma_ptes() ... Browse Code »

All callers of zap_vma_ptes() are not interested in the return value of
that function, so let's simplify its interface and drop the return
value.

Signed-off-by: Leon Romanovsky
Signed-off-by: Doug Ledford

Leon Romanovsky
2018-06-01 23:20:43 +0800

06 Apr, 2018

2 commits

e9e9b7ece mm: swap: unify cluster-based and vma-based swap readahead ... Browse Code »

This patch makes do_swap_page() not need to be aware of two different
swap readahead algorithms. Just unify cluster-based and vma-based
readahead function call.

Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-04-06 12:36:25 +0800
eaf649ebc mm: swap: clean up swap readahead ... Browse Code »

When I see recent change of swap readahead, I am very unhappy about
current code structure which diverges two swap readahead algorithm in
do_swap_page. This patch is to clean it up.

Main motivation is that fault handler doesn't need to be aware of
readahead algorithms but just should call swapin_readahead.

As first step, this patch cleans up a little bit but not perfect (I just
separate for review easier) so next patch will make the goal complete.

[minchan@kernel.org: do not check readahead flag with THP anon]
Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-04-06 12:36:25 +0800

18 Mar, 2018

1 commit

ca827d55e mm, swap: Add infrastructure for saving page metadata on swap ... Browse Code »

If a processor supports special metadata for a page, for example ADI
version tags on SPARC M7, this metadata must be saved when the page is
swapped out. The same metadata must be restored when the page is swapped
back in. This patch adds two new architecture specific functions -
arch_do_swap_page() to be called when a page is swapped in, and
arch_unmap_one() to be called when a page is being unmapped for swap
out. These architecture hooks allow page metadata to be saved if the
architecture supports it.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Acked-by: Jerome Marchand
Reviewed-by: Anthony Yznaga
Acked-by: Andrew Morton
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:45 +0800

17 Feb, 2018

1 commit

af27d9403 mm: hide a #warning for COMPILE_TEST ... Browse Code »

We get a warning about some slow configurations in randconfig kernels:

mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

The warning is reasonable by itself, but gets in the way of randconfig
build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

The warning was added in 2013 in commit 75980e97dacc ("mm: fold
page->_last_nid into page->flags where possible").

Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann
Signed-off-by: Linus Torvalds

Arnd Bergmann
2018-02-17 01:41:36 +0800

07 Feb, 2018

3 commits

a2e5790d8 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- kasan updates

- procfs

- lib/bitmap updates

- other lib/ updates

- checkpatch tweaks

- rapidio

- ubsan

- pipe fixes and cleanups

- lots of other misc bits

* emailed patches from Andrew Morton : (114 commits)
Documentation/sysctl/user.txt: fix typo
MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
MAINTAINERS: update various PALM patterns
MAINTAINERS: update "ARM/OXNAS platform support" patterns
MAINTAINERS: update Cortina/Gemini patterns
MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
MAINTAINERS: remove ANDROID ION pattern
mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
mm: docs: fix parameter names mismatch
mm: docs: fixup punctuation
pipe: read buffer limits atomically
pipe: simplify round_pipe_size()
pipe: reject F_SETPIPE_SZ with size over UINT_MAX
pipe: fix off-by-one error when checking buffer limits
pipe: actually allow root to exceed the pipe buffer limits
pipe, sysctl: remove pipe_proc_fn()
pipe, sysctl: drop 'min' parameter from pipe-max-size converter
kasan: rework Kconfig settings
crash_dump: is_kdump_kernel can be boolean
kernel/mutex: mutex_is_locked can be boolean
...

Linus Torvalds
2018-02-07 14:15:42 +0800
e7c98df59 mm: remove unneeded kallsyms include ... Browse Code »

The file was converted from print_symbol() to %pSR a while ago in commit
071361d3473e ("mm: Convert print_symbol to %pSR"). kallsyms does not
seem to be needed anymore.

Link: http://lkml.kernel.org/r/20171208025616.16267-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2018-02-07 10:32:47 +0800
3ff1b28ca Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Ross Zwisler:

- Require struct page by default for filesystem DAX to remove a number
of surprising failure cases. This includes failures with direct I/O,
gdb and fork(2).

- Add support for the new Platform Capabilities Structure added to the
NFIT in ACPI 6.2a. This new table tells us whether the platform
supports flushing of CPU and memory controller caches on unexpected
power loss events.

- Revamp vmem_altmap and dev_pagemap handling to clean up code and
better support future future PCI P2P uses.

- Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
spec, and instead rely on the generic ND_CMD_CALL approach used by
the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

- Enhance nfit_test so we can test some of the new things added in
version 1.6 of the DSM specification. This includes testing firmware
download and simulating the Last Shutdown State (LSS) status.

* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
acpi, nfit: fix register dimm error handling
libnvdimm, namespace: make min namespace size 4K
tools/testing/nvdimm: force nfit_test to depend on instrumented modules
libnvdimm/nfit_test: adding support for unit testing enable LSS status
libnvdimm/nfit_test: add firmware download emulation
nfit-test: Add platform cap support from ACPI 6.2a to test
libnvdimm: expose platform persistence attribute for nd_region
acpi: nfit: add persistent memory control flag for nd_region
acpi: nfit: Add support for detect platform CPU cache flush on power loss
device-dax: Fix trailing semicolon
libnvdimm, btt: fix uninitialized err_lock
dax: require 'struct page' by default for filesystem dax
ext2: auto disable dax instead of failing mount
ext4: auto disable dax instead of failing mount
mm, dax: introduce pfn_t_special()
mm: Fix devm_memremap_pages() collision handling
mm: Fix memory size alignment in devm_memremap_pages_release()
memremap: merge find_dev_pagemap into get_dev_pagemap
memremap: change devm_memremap_pages interface to use struct dev_pagemap
...

Linus Torvalds
2018-02-07 02:41:33 +0800

01 Feb, 2018

3 commits

da391d640 mm: correct comments regarding do_fault_around() ... Browse Code »

There are multiple comments surrounding do_fault_around that memtion
fault_around_pages() and fault_around_mask(), two routines that do not
exist. These comments should be reworded to reference
fault_around_bytes, the value which is used to determine how much
do_fault_around() will attempt to read when processing a fault.

These comments should have been updated when fault_around_pages() and
fault_around_mask() were removed in commit aecd6f44266c ("mm: close race
between do_fault_around() and fault_around_bytes_set()").

Fixes: aecd6f44266c1 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.com
Signed-off-by: William Kucharski
Reviewed-by: Larry Bassel
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

William Kucharski
2018-02-01 09:18:40 +0800
977fbdcd5 mm: add unmap_mapping_pages() ... Browse Code »

Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.

Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.

Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
Signed-off-by: Matthew Wilcox
Reported-by: "zhangyi (F)"
Reviewed-by: Ross Zwisler
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-02-01 09:18:37 +0800
ef549e13c mm: update comment describing tlb_gather_mmu ... Browse Code »

The comment describes @fullmm argument, but the function has no such
parameter.

Update the comment to match the code and convert it to kernel-doc
markup.

Link: http://lkml.kernel.org/r/1512394531-2264-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-02-01 09:18:37 +0800