Eric Lee / smarc-fsl-linux-kernel

01 Feb, 2017

1 commit

6676aa654 mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp ... Browse Code »

commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream.

In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
__get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
after a COW was resolved to setting the (newly introduced) FOLL_COW
instead. Simultaneously, the check in gup.c was updated to still allow
writes with FOLL_FORCE set if FOLL_COW had also been set.

However, a similar check in huge_memory.c was forgotten. As a result,
remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets
FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
true.

While in this state the process is stil SIGKILLable, but little else
works (e.g. no ptrace attach, no other signals). This is easily
reproduced with the following code (assuming thp are set to always):

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#define TEST_SIZE 5 * 1024 * 1024

int main(void) {
int status;
pid_t child;
int fd = open("/proc/self/mem", O_RDWR);
void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr != MAP_FAILED);
pid_t parent_pid = getpid();
if ((child = fork()) == 0) {
void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr2 != MAP_FAILED);
memset(addr2, 'a', TEST_SIZE);
pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
return 0;
}
assert(child == waitpid(child, &status, 0));
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
return 0;
}

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
to the update in gup.c in the original commit. The same pattern exists
in follow_devmap_pmd. However, we should not be able to reach that
check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
ever do.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
Signed-off-by: Keno Fischer
Acked-by: Kirill A. Shutemov
Cc: Greg Thelen
Cc: Nicholas Piggin
Cc: Willy Tarreau
Cc: Oleg Nesterov
Cc: Kees Cook
Cc: Andy Lutomirski
Cc: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Keno Fischer
2017-02-01 15:33:04 +0800

20 Jan, 2017

1 commit

8edd365ee mm: pmd dirty emulation in page fault handler ... Browse Code »

commit 20f664aabeb88d582b623a625f83b0454fa34f07 upstream.

Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:

http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

The problem is currently page fault handler doesn't supports dirty bit
emulation of pmd for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.

How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pmd dirty and arch code for
setting pmd will clear PTE_RDONLY for application to proceed.

IOW, if VM doesn't mark the pmd dirty, application hangs forever by
repeated fault(i.e., store op but the pmd is PTE_RDONLY).

This patch enables pmd dirty-bit emulation for those architectures.

[1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Reported-by: Andreas Schwab
Tested-by: Andreas Schwab
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc: Jason Evans
Cc: Will Deacon
Cc: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Minchan Kim
2017-01-20 03:17:58 +0800

30 Nov, 2016

1 commit

a2ce2666a mremap: move_ptes: check pte dirty after its removal ... Browse Code »

Linus found there still is a race in mremap after commit 5d1904204c99
("mremap: fix race between mremap() and page cleanning").

As described by Linus:
"the issue is that another thread might make the pte be dirty (in the
hardware walker, so no locking of ours will make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables"

Fix it by moving the check after we removed it from the page table.

Suggested-by: Linus Torvalds
Signed-off-by: Aaron Lu
Signed-off-by: Linus Torvalds

Aaron Lu
2016-11-30 00:20:24 +0800

18 Nov, 2016

1 commit

5d1904204 mremap: fix race between mremap() and page cleanning ... Browse Code »

Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().

We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.

The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().

Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.

thread 1 thread 2
(bound to CPU1) (bound to CPU2)

1: write 1 to addr X to get a
writeable TLB on this CPU

2: mremap starts

3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y

4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.

5: *write 2 to addr X

6: tlb flush for addr X

7: munmap (Y, pagesize) to make the
page unmapped

8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache

9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.

*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.

Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.

For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.

To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.

Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.

KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s

With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s

Reported-by: Dave Hansen
Signed-off-by: Aaron Lu
Signed-off-by: Linus Torvalds

Aaron Lu
2016-11-18 01:46:56 +0800

08 Oct, 2016

3 commits

6d2329f88 mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE ... Browse Code »

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
6fcb52a56 thp: reduce usage of huge zero page's atomic counter ... Browse Code »

The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:

1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu
Cc: Sergey Senozhatsky
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Cc: Tim Chen
Cc: Huang Ying
Cc: Vlastimil Babka
Cc: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Ebru Akagunduz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2016-10-08 09:46:28 +0800
74d2fad13 thp, dax: add thp_get_unmapped_area for pmd mappings ... Browse Code »

When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
This feature relies on both mmap virtual address and FS block (i.e.
physical address) to be aligned by the pmd page size. Users can use
mkfs options to specify FS to align block allocations. However,
aligning mmap address requires code changes to existing applications for
providing a pmd-aligned address to mmap().

For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.

Add thp_get_unmapped_area(), which can be called by filesystem's
get_unmapped_area to align an mmap address by the pmd size for a DAX
file. It calls the default handler, mm->get_unmapped_area(), to find a
range and then aligns it for a DAX file.

The patch is based on Matthew Wilcox's change that allows adding support
of the pud page size easily.

[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Reviewed-by: Dan Williams
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Cc: Dave Chinner
Cc: Jan Kara
Cc: Theodore Ts'o
Cc: Andreas Dilger
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2016-10-08 09:46:28 +0800

30 Sep, 2016

1 commit

536e0e81e Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-30 16:44:27 +0800

26 Sep, 2016

1 commit

38e088546 mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing ... Browse Code »

The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
PMDs respectively as requiring balancing upon a subsequent page fault.
User-defined PROT_NONE memory regions which also have this flag set will
not normally invoke the NUMA balancing code as do_page_fault() will send
a segfault to the process before handle_mm_fault() is even called.

However if access_remote_vm() is invoked to access a PROT_NONE region of
memory, handle_mm_fault() is called via faultin_page() and
__get_user_pages() without any access checks being performed, meaning
the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
region.

A simple means of triggering this problem is to access PROT_NONE mmap'd
memory using /proc/self/mem which reliably results in the NUMA handling
functions being invoked when CONFIG_NUMA_BALANCING is set.

This issue was reported in bugzilla (issue 99101) which includes some
simple repro code.

There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
added at commit c0e7cad to avoid accidentally provoking strange
behaviour by attempting to apply NUMA balancing to pages that are in
fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

This patch moves the PROT_NONE check into mm/memory.c rather than
invoking BUG_ON() as faulting in these pages via faultin_page() is a
valid reason for reaching the NUMA check with the PROT_NONE page table
flag set and is therefore not always a bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
Reported-by: Trevor Saunders
Signed-off-by: Lorenzo Stoakes
Acked-by: Rik van Riel
Cc: Andrew Morton
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-09-26 06:43:42 +0800

22 Sep, 2016

1 commit

50797851b Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-22 20:49:40 +0800

14 Sep, 2016

1 commit

d59dc7bcf sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags ... Browse Code »

Commit:

4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
found to introduce a regression with NUMA grouping.

It was followed up by these commits:

53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two commits try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
v3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before commit 4d9424669946, which the changelogs of subsequent
commits suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: aarcange@redhat.com
Cc: linux-mm@kvack.org
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-09-14 02:31:33 +0800

10 Sep, 2016

1 commit

ca120cf68 mm: fix show_smap() for zone_device-pmd ranges ... Browse Code »

Attempting to dump /proc//smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

kernel BUG at mm/huge_memory.c:1105!
task: ffff88045f16b140 task.stack: ffff88045be14000
RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
[..]
Call Trace:
[] smaps_pte_range+0xa0/0x4b0
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

kernel BUG at fs/proc/task_mmu.c:585!
RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
Call Trace:
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Acked-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2016-09-10 08:34:45 +0800

27 Aug, 2016

1 commit

804dd1504 soft_dirty: fix soft_dirty during THP split ... Browse Code »

While adding proper userfaultfd_wp support with bits in pagetable and
swap entry to avoid false positives WP userfaults through swap/fork/
KSM/etc, I've been adding a framework that mostly mirrors soft dirty.

So I noticed in one place I had to add uffd_wp support to the pagetables
that wasn't covered by soft_dirty and I think it should have.

Example: in the THP migration code migrate_misplaced_transhuge_page()
pmd_mkdirty is called unconditionally after mk_huge_pmd.

entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

That sets soft dirty too (it's a false positive for soft dirty, the soft
dirty bit could be more finegrained and transfer the bit like uffd_wp
will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set
pmd/pte_write is not set).

However in the THP split there's no unconditional pmd_mkdirty after
mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
entry is created. The code sets the dirty bit in the struct page
instead of setting it in the pagetable (which is fully equivalent as far
as the real dirty bit is concerned, as the whole point of pagetable bits
is to be eventually flushed out of to the page, but that is not
equivalent for the soft-dirty bit that gets lost in translation).

This was found by code review only and totally untested as I'm working
to actually replace soft dirty and I don't have time to test potential
soft dirty bugfixes as well :).

Transfer the soft_dirty from pmd to pte during THP splits.

This fix avoids losing the soft_dirty bit and avoids userland memory
corruption in the checkpoint.

Fixes: eef1b3ba053aa6 ("thp: implement split_huge_pmd()")
Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-08-27 08:39:35 +0800

29 Jul, 2016

5 commits

251603549 mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations ... Browse Code »

After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as
follows:

* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.

As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).

[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-07-29 07:07:41 +0800
319904ad4 mm, THP: clean up return value of madvise_free_huge_pmd ... Browse Code »

The definition of return value of madvise_free_huge_pmd is not clear
before. According to the suggestion of Minchan Kim, change the type of
return value to bool and return true if we do MADV_FREE successfully on
entire pmd page, otherwise, return false. Comments are added too.

Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Acked-by: Minchan Kim
Cc: "Kirill A. Shutemov"
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Dan Williams
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2016-07-29 07:07:41 +0800
11fb99898 mm: move most file-based accounting to the node ... Browse Code »

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
599d0c954 mm, vmscan: move LRU lists to node ... Browse Code »

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
a52633d8e mm, vmscan: move lru_lock to the node ... Browse Code »

Node-based reclaim requires node-based LRUs and locking. This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review. It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.

Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

27 Jul, 2016

22 commits

8f19b0c05 thp: fix comments of __pmd_trans_huge_lock() ... Browse Code »

To make the comments consistent with the already changed code.

Link: http://lkml.kernel.org/r/1466200004-6196-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2016-07-27 07:19:19 +0800
e496cf3d7 thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE ... Browse Code »

For file mappings, we don't deposit page tables on THP allocation
because it's not strictly required to implement split_huge_pmd(): we can
just clear pmd and let following page faults to reconstruct the page
table.

But Power makes use of deposited page table to address MMU quirk.

Let's hide THP page cache, including huge tmpfs, under separate config
option, so it can be forbidden on Power.

We can revert the patch later once solution for Power found.

Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
b46e756f5 thp: extract khugepaged from mm/huge_memory.c ... Browse Code »

khugepaged implementation grew to the point when it deserve separate
file in source.

Let's move it to mm/khugepaged.c.

Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
657e3038c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings ... Browse Code »

Let's wire up existing madvise() hugepage hints for file mappings.

MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA. It only has effect if the filesystem is mounted with huge=advise
or huge=within_size.

MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA. It doesn't prevent a huge page from being allocated by other
means, i.e. page fault into different mapping or write(2) into file.

Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
800d8c63b shmem: add huge pages support ... Browse Code »

Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

- shmem_getpage() allcoates huge page if it can and try to inserd into
radix tree with shmem_add_to_page_cache();

- shmem_add_to_page_cache() puts the page onto radix-tree if there's
space for it;

- shmem_undo_range() removes huge pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.

This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
behaviour. As we don't really create hole in this case,
lseek(SEEK_HOLE) may have inconsistent results depending what
pages happened to be allocated.

- no need to change shmem_fault: core-mm will map an compound page as
huge if VMA is suitable;

Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
5a6e75f81 shmem: prepare huge= mount option and sysfs knob ... Browse Code »

This patch adds new mount option "huge=". It can have following values:

- "always":
Attempt to allocate huge pages every time we need a new page;

- "never":
Do not allocate huge pages;

- "within_size":
Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;

- "advise:
Only allocate huge pages if requested with fadvise()/madvise();

Default is "never" for now.

"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
is the appropriate option to protect those who don't want the new bloat,
and with which we shall share some pmd code.

Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).

Allow enabling THP only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob
/sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
huge on those.

And allow shmem_enabled two further values:

- "deny":
For use in emergencies, to force the huge option off from
all mounts;
- "force":
Force the huge option on for all - very useful for testing;

Based on patch by Hugh Dickins.

Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
65c453778 mm, rmap: account shmem thp pages ... Browse Code »

Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps. It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
9a73f61bd thp, mlock: do not mlock PTE-mapped file huge pages ... Browse Code »

As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.

We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
the page mapped with PTEs.

Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
baa355fd3 thp: file pages support for split_huge_page() ... Browse Code »

Basic scheme is the same as for anon THP.

Main differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- Lockless access is prevented by setting the head->_count to 0 during
split;

- After split, some pages can be beyond i_size. We drop them from
radix-tree.

- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.

Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
b237aded4 thp: prepare change_huge_pmd() for file thp ... Browse Code »

change_huge_pmd() has assert which is not relvant for file page. For
shared mapping it's perfectly fine to have page table entry writable,
without explicit mkwrite.

Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
628d47ce9 thp: skip file huge pmd on copy_huge_pmd() ... Browse Code »

copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level. We still copy all
page table entries from private mappings, even if they map page cache.

We can simplify copy_huge_pmd() a bit by skipping file PMDs.

We don't map file private pages with PMDs, so they only can map page
cache. It's safe to skip them as they can be re-faulted later.

Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
d21b9e57c thp: handle file pages in split_huge_pmd() ... Browse Code »

Splitting THP PMD is simple: just unmap it as in DAX case. This way we
can avoid memory overhead on page table allocation to deposit.

It's probably a good idea to try to allocation page table with
GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
but clearing pmd should be good enough for now.

Unlike DAX, we also remove the page from rmap and drop reference.
pmd_young() is transfered to PageReferenced().

Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
b5072380e thp: support file pages in zap_huge_pmd() ... Browse Code »

split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.

This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.

Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
101024596 mm: introduce do_set_pmd() ... Browse Code »

With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;

Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
dd78fedde rmap: support file thp ... Browse Code »

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
bae473a42 mm: introduce fault_env ... Browse Code »

The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context without
endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
1f52e67e5 khugepaged: recheck pmd after mmap_sem re-acquired ... Browse Code »

Vlastimil noted[1] that pmd can be no longer valid after we drop
mmap_sem. We need recheck it once mmap_sem taken again.

[1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz

Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
8024ee2a0 mm, thp: fix locking inconsistency in collapse_huge_page ... Browse Code »

After creating revalidate vma function, locking inconsistency occured
due to directing the code path to wrong label. This patch directs to
correct label and fix the inconsistency.

Related commit that caused inconsistency:
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Ebru Akagunduz
Cc: Vlastimil Babka
Cc: Sergey Senozhatsky
Cc: Kirill A. Shutemov
Cc: Stephen Rothwell
Cc: Rik van Riel
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-07-27 07:19:19 +0800
726958625 mm, thp: make swapin readahead under down_read of mmap_sem ... Browse Code »

Currently khugepaged makes swapin readahead under down_write. This
patch supplies to make swapin readahead under down_read instead of
down_write.

The patch was tested with a test program that allocates 800MB of memory,
writes to it, and then sleeps. The system was forced to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.

[akpm@linux-foundation.org: update comment to match new code]
[kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Ebru Akagunduz
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Cyrill Gorcunov
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Aneesh Kumar K.V
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-07-27 07:19:19 +0800
8a966ed74 mm: make swapin readahead to improve thp collapse rate ... Browse Code »

This patch makes swapin readahead to improve thp collapse rate. When
khugepaged scanned pages, there can be a few of the pages in swap area.

With the patch THP can collapse 4kB pages into a THP when there are up
to max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.

Without the patch, system did not swap in readahead. THP rate was %65
of the program of the memory, it did not change over time.

With this patch, after 10 minutes of waiting khugepaged had collapsed
%99 of the program's memory.

[kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
[kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
[kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
Signed-off-by: Ebru Akagunduz
Acked-by: Rik van Riel
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Xie XiuQi
Cc: Cyrill Gorcunov
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Aneesh Kumar K.V
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-07-27 07:19:19 +0800
70652f6ec mm: make optimistic check for swapin readahead ... Browse Code »

Introduce a new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
optimistic check for swapin readahead to increase thp collapse rate.
Before getting swapped out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints amount of unmapped
ptes.

[vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
[sfr@canb.auug.org.au: build fix]
Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
Signed-off-by: Ebru Akagunduz
Acked-by: Rik van Riel
Cc: Kirill A. Shutemov
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Xie XiuQi
Cc: Cyrill Gorcunov
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Aneesh Kumar K.V
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-07-27 07:19:19 +0800
e77b0852b mm/mmu_gather: track page size with mmu gather and force flush if page size change ... Browse Code »

This allows an arch which needs to do special handing with respect to
different page size when flushing tlb to implement the same in mmu
gather.

Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: David Rientjes
Cc: Vlastimil Babka
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2016-07-27 07:19:19 +0800