Eric Lee / smarc-fsl-linux-kernel

26 Mar, 2016

1 commit

0fda2788b thp: fix typo in khugepaged_scan_pmd() ... Browse Code »

!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.

Signed-off-by: Kirill A. Shutemov
Cc: Ebru Akagunduz
Cc: Rik van Riel
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-26 07:37:42 +0800

20 Mar, 2016

1 commit

d5e2d0089 Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc updates from Michael Ellerman:
"This was delayed a day or two by some build-breakage on old toolchains
which we've now fixed.

There's two PCI commits both acked by Bjorn.

There's one commit to mm/hugepage.c which is (co)authored by Kirill.

Highlights:
- Restructure Linux PTE on Book3S/64 to Radix format from Paul
Mackerras
- Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
Kumar K.V
- Add POWER9 cputable entry from Michael Neuling
- FPU/Altivec/VSX save/restore optimisations from Cyril Bur
- Add support for new ftrace ABI on ppc64le from Torsten Duwe

Various cleanups & minor fixes from:
- Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

General:
- atomics: Allow architectures to define their own __atomic_op_*
helpers from Boqun Feng
- Implement atomic{, 64}_*_return_* variants and acquire/release/
relaxed variants for (cmp)xchg from Boqun Feng
- Add powernv_defconfig from Jeremy Kerr
- Fix BUG_ON() reporting in real mode from Balbir Singh
- Add xmon command to dump OPAL msglog from Andrew Donnellan
- Add xmon command to dump process/task similar to ps(1) from Douglas
Miller
- Clean up memory hotplug failure paths from David Gibson

pci/eeh:
- Redesign SR-IOV on PowerNV to give absolute isolation between VFs
from Wei Yang.
- EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
- PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
- PCI: Add pcibios_bus_add_device() weak function from Wei Yang
- MAINTAINERS: Update EEH details and maintainership from Russell
Currey

cxl:
- Support added to the CXL driver for running on both bare-metal and
hypervisor systems, from Christophe Lombard and Frederic Barrat.
- Ignore probes for virtual afu pci devices from Vaibhav Jain

perf:
- Export Power8 generic and cache events to sysfs from Sukadev
Bhattiprolu
- hv-24x7: Fix usage with chip events, display change in counter
values, display domain indices in sysfs, eliminate domain suffix in
event names, from Sukadev Bhattiprolu

Freescale:
- Updates from Scott: "Highlights include 8xx optimizations, 32-bit
checksum optimizations, 86xx consolidation, e5500/e6500 cpu
hotplug, more fman and other dt bits, and minor fixes/cleanup"

* tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
powerpc: Fix unrecoverable SLB miss during restore_math()
powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
powerpc/rcpm: Fix build break when SMP=n
powerpc/book3e-64: Use hardcoded mttmr opcode
powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
powerpc/T104xRDB: add tdm riser card node to device tree
powerpc32: PAGE_EXEC required for inittext
powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
powerpc/86xx: Introduce and use common dtsi
powerpc/86xx: Update device tree
powerpc/86xx: Move dts files to fsl directory
powerpc/86xx: Switch to kconfig fragments approach
powerpc/86xx: Update defconfigs
powerpc/86xx: Consolidate common platform code
powerpc32: Remove one insn in mulhdu
powerpc32: small optimisation in flush_icache_range()
powerpc: Simplify test in __dma_sync()
powerpc32: move xxxxx_dcache_range() functions inline
powerpc32: Remove clear_pages() and define clear_page() inline
...

Linus Torvalds
2016-03-20 06:38:41 +0800

18 Mar, 2016

7 commits

5f7377147 thp: fix deadlock in split_huge_pmd() ... Browse Code »

split_huge_pmd() tries to munlock page with munlock_vma_page(). That
requires the page to locked.

If the is locked by caller, we would get a deadlock:

Unable to find swap-space signature
INFO: task trinity-c85:1907 blocked for more than 120 seconds.
Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c85 D ffff88084d997608 0 1907 309 0x00000000
Call Trace:
schedule+0x9f/0x1c0
schedule_timeout+0x48e/0x600
io_schedule_timeout+0x1c3/0x390
bit_wait_io+0x29/0xd0
__wait_on_bit_lock+0x94/0x140
__lock_page+0x1d4/0x280
__split_huge_pmd+0x5a8/0x10f0
split_huge_pmd_address+0x1d9/0x230
try_to_unmap_one+0x540/0xc70
rmap_walk_anon+0x284/0x810
rmap_walk_locked+0x11e/0x190
try_to_unmap+0x1b1/0x4b0
split_huge_page_to_list+0x49d/0x18a0
follow_page_mask+0xa36/0xea0
SyS_move_pages+0xaf3/0x1570
entry_SYSCALL_64_fastpath+0x12/0x6b
2 locks held by trinity-c85/1907:
#0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570
#1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0

I don't think the deadlock is triggerable without split_huge_page()
simplifilcation patchset.

But munlock_vma_page() here is wrong: we want to munlock the page
unconditionally, no need in rmap lookup, that munlock_vma_page() does.

Let's use clear_page_mlock() instead. It can be called under ptl.

Fixes: e90309c9f772 ("thp: allow mlocked THP again")
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-18 06:09:34 +0800
fec89c109 thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers ... Browse Code »

freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts. It would be nice to cut complexity of this code.

This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().

The result is much simpler.

But the new variant is somewhat slower for PTE-mapped THPs. Current
helpers iterates over VMAs the compound page is mapped to, and then over
ptes within this VMA. New helpers iterates over small page, then over
VMA the small page mapped to, and only then find relevant pte.

We have short cut for PMD-mapped THP: we directly install migration
entries on PMD split.

I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays. It only
happens due memory pressure or migration.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-18 06:09:34 +0800
2a52bcbcc rmap: extend try_to_unmap() to be usable by split_huge_page() ... Browse Code »

Add support for two ttu_flags:

- TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
unmap page;

- TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

Also, change rwc->done to !page_mapcount() instead of !page_mapped().
try_to_unmap() works on pte level, so we are really interested in the
mappedness of this small page rather than of the compound page it's a
part of.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-18 06:09:34 +0800
756a025f0 mm: coalesce split strings ... Browse Code »

Kernel style prefers a single string over split strings when the string is
'user-visible'.

Miscellanea:

- Add a missing newline
- Realign arguments

Signed-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800
fe896d187 mm: introduce page reference manipulation functions ... Browse Code »

The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.

In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Sergey Senozhatsky
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-03-18 06:09:34 +0800
444eb2a44 mm: thp: set THP defrag by default to madvise and add a stall-free defrag option ... Browse Code »

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.

First, this patch defines THP defrag as follows;

madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.

Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.

This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this

usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is

4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41

Which is substantial. Now, the reclaim figures

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0

This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used

thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload

4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1

Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.

I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available

stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98

And once again we look at the reclaim figures

4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-03-18 06:09:34 +0800
f9719a03d thp, vmstats: count deferred split events ... Browse Code »

Count how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.

Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.

The event also can help with debugging kernel [mis-]behaviour.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-18 06:09:34 +0800

16 Mar, 2016

1 commit

8df651c70 thp: cleanup split_huge_page() ... Browse Code »

After one of bugfixes to freeze_page(), we don't have freezed pages in
rmap, therefore mapcount of all subpages of freezed THP is zero. And we
have assert for that.

Let's drop code which deal with non-zero mapcount of subpages.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-16 07:55:16 +0800

03 Mar, 2016

1 commit

ff20c2e0a mm: Some arch may want to use HPAGE_PMD related values as variables ... Browse Code »

With next generation power processor, we are having a new mmu model
[1] that require us to maintain a different linux page table format.

Inorder to support both current and future ppc64 systems with a single
kernel we need to make sure kernel can select between different page
table format at runtime. With the new MMU (radix MMU) added, we will
have two different pmd hugepage size 16MB for hash model and 2MB for
Radix model. Hence make HPAGE_PMD related values as a variable.

Actual conversion of HPAGE_PMD to a variable for ppc64 happens in a
followup patch.

[1] http://ibm.biz/power-isa3 (Needs registration).

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman

Kirill A. Shutemov
2016-03-03 18:18:29 +0800

25 Feb, 2016

2 commits

2527083cb Merge tag 'powerpc-4.5-4' into next ... Browse Code »

Pull in our current fixes from 4.5, in particular the "Fix Multi hit
ERAT" bug is causing folks some grief when testing next.

Michael Ellerman
2016-02-25 18:52:58 +0800
2ac015e29 thp: call pmdp_invalidate() with correct virtual address ... Browse Code »

Sebastian Ott and Gerald Schaefer reported random crashes on s390.
It was bisected to my THP refcounting patchset.

The problem is that pmdp_invalidated() called with wrong virtual
address. It got offset up by HPAGE_PMD_SIZE by loop over ptes.

The solution is to introduce new variable to be used in loop and don't
touch 'haddr'.

Signed-off-by: Kirill A. Shutemov
Reported-and-tested-by: Gerald Schaefer
Reported-and-tested-by Sebastian Ott
Reviewed-by: Will Deacon
Cc: Christian Borntraeger
Cc: Martin Schwidefsky
Cc: Aneesh Kumar K.V
Cc: Andrea Arcangeli
Cc: Sasha Levin
Cc: Jerome Marchand
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-25 02:46:30 +0800

21 Feb, 2016

1 commit

e6a1c1e9d Merge tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc fixes from Michael Ellerman:
- Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
- Fix dedotify for binutils >= 2.26 from Andreas Schwab
- Don't trace hcalls on offline CPUs from Denis Kirjanov
- eeh: Fix stale cached primary bus from Gavin Shan
- eeh: Fix stale PE primary bus from Gavin Shan
- mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
- ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy

* tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/ioda: Set "read" permission when "write" is set
powerpc/mm: Fix Multi hit ERAT cause by recent THP update
powerpc/powernv: Fix stale PE primary bus
powerpc/eeh: Fix stale cached primary bus
powerpc/pseries: Don't trace hcalls on offline CPUs
powerpc: Fix dedotify for binutils >= 2.26
powerpc/book3s_32: Fix build error with checkpoint restart

Linus Torvalds
2016-02-21 01:22:11 +0800

19 Feb, 2016

1 commit

69a8ec2d8 thp, dax: do not try to withdraw pgtable from non-anon VMA ... Browse Code »

DAX doesn't deposit pgtables when it maps huge pages: nothing to
withdraw. It can lead to crash.

Signed-off-by: Kirill A. Shutemov
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-19 08:23:24 +0800

15 Feb, 2016

1 commit

c777e2a8b powerpc/mm: Fix Multi hit ERAT cause by recent THP update ... Browse Code »

With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

CPU0 CPU1
shrink_page_list()
add_to_swap()
split_huge_page_to_list()
__split_huge_pmd_locked()
pmdp_huge_clear_flush_notify()
// pmd_none() == true
exit_mmap()
unmap_vmas()
zap_pmd_range()
// no action on pmd since pmd_none() == true
pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Fixes: eef1b3ba053a ("thp: implement split_huge_pmd()")
Acked-by: Kirill A. Shutemov
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman

Aneesh Kumar K.V
2016-02-15 18:10:04 +0800

06 Feb, 2016

1 commit

ae026204a thp: make deferred_split_scan() work again ... Browse Code »

We need to iterate over split_queue, not local empty list to get
anything split from the shrinker.

Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()")
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-06 10:10:40 +0800

04 Feb, 2016

4 commits

12c9d70bd mm: fix memory leak in copy_huge_pmd() ... Browse Code »

We allocate a pgtable but do not attach it to anything if the PMD is in
a DAX VMA, causing it to leak.

We certainly try to not free pgtables associated with the huge zero page
if the zero page is in a DAX VMA, so I think this is the right solution.
This needs to be properly audited.

Signed-off-by: Matthew Wilcox
Cc: Dan Williams
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2016-02-04 00:28:43 +0800
e3ae19535 thp: limit number of object to scan on deferred_split_scan() ... Browse Code »

If we have a lot of pages in queue to be split, deferred_split_scan()
can spend unreasonable amount of time under spinlock with disabled
interrupts.

Let's cap number of pages to split on scan by sc->nr_to_scan.

Signed-off-by: Kirill A. Shutemov
Reported-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800
cb8d68ec1 thp: change deferred_split_count() to return number of THP in queue ... Browse Code »

I've got meaning of shrinker::count_objects() wrong: it should return
number of potentially freeable objects, which is not necessary correlate
with freeable memory.

Returning 256 per THP in queue is not reasonable:
shrinker::scan_objects() never called with nr_to_scan > 128 in my setup.

Let's return 1 per THP and correct scan_object accordingly.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800
a3d0a9185 thp: make split_queue per-node ... Browse Code »

Andrea Arcangeli suggested to make split queue per-node to improve
scalability. Let's do it.

Signed-off-by: Kirill A. Shutemov
Suggested-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800

22 Jan, 2016

2 commits

16fd0fe4a mm: fix kernel crash in khugepaged thread ... Browse Code »

This crash is caused by NULL pointer deference, in page_to_pfn() marco,
when page == NULL :

Unable to handle kernel NULL pointer dereference at virtual address 00000000
Internal error: Oops: 94000006 [#1] SMP
Modules linked in:
CPU: 1 PID: 26 Comm: khugepaged Tainted: G W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
PC is at khugepaged+0x378/0x1af8
LR is at khugepaged+0x418/0x1af8
Process khugepaged (pid: 26, stack limit = 0xffffffc079638020)
Call trace:
khugepaged+0x378/0x1af8
kthread+0xdc/0xf4
ret_from_fork+0xc/0x40
Code: 35001700 f0002c60 aa0703e3 f9009fa0 (f94000e0)
---[ end trace 637503d8e28ae69e ]---
Kernel panic - not syncing: Fatal exception
CPU2: stopping
CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D W 4.3.0-rc6-next-20151022ajb-00001-g32f3386-dirty #3
Hardware name: linux,dummy-virt (DT)

[akpm@linux-foundation.org: fix fat-fingered merge resolution]
Signed-off-by: yalin wang
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2016-01-22 09:20:51 +0800
b6ec57f4b thp: change pmd_trans_huge_lock() interface to return ptl ... Browse Code »

After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.

Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.

Signed-off-by: Kirill A. Shutemov
Suggested-by: Linus Torvalds
Cc: Minchan Kim
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-22 09:20:51 +0800

21 Jan, 2016

2 commits

0b9b6fff7 thp: fix interrupt unsafe locking in split_huge_page() ... Browse Code »

split_queue_lock can be taken from interrupt context in some cases, but
I forgot to convert locking in split_huge_page() to interrupt-safe
primitives.

Let's fix this.

lockdep output:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
4.4.0+ #259 Tainted: G W
------------------------------------------------------
syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
(split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436

and this task is already holding:
(slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
(slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
which would create a new lock dependency:
(slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}

but this new dependency connects a SOFTIRQ-irq-safe lock:
(slock-AF_INET){+.-...}
... which became SOFTIRQ-irq-safe at:
mark_irqflags kernel/locking/lockdep.c:2799
__lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
flush_stack+0x50/0x330 net/ipv6/udp.c:799
__udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
__udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
dst_input include/net/dst.h:498
ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
NF_HOOK_THRESH include/linux/netfilter.h:226
NF_HOOK include/linux/netfilter.h:249
ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
__netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
__netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
napi_skb_finish net/core/dev.c:4542
napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
napi_poll net/core/dev.c:5074
net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
__do_softirq+0x26a/0x920 kernel/softirq.c:273
invoke_softirq kernel/softirq.c:350
irq_exit+0x18f/0x1d0 kernel/softirq.c:391
exiting_irq ./arch/x86/include/asm/apic.h:659
do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
cpuidle_idle_call kernel/sched/idle.c:156
cpu_idle_loop kernel/sched/idle.c:252
cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
rest_init+0x192/0x1a0 init/main.c:412
start_kernel+0x678/0x69e init/main.c:683
x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184

to a SOFTIRQ-irq-unsafe lock:
(split_queue_lock){+.+...}
which became SOFTIRQ-irq-unsafe at:
mark_irqflags kernel/locking/lockdep.c:2817
__lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
__raw_spin_lock include/linux/spinlock_api_smp.h:144
_raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:302
split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
split_huge_page include/linux/huge_mm.h:99
queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
walk_pmd_range mm/pagewalk.c:50
walk_pud_range mm/pagewalk.c:90
walk_pgd_range mm/pagewalk.c:116
__walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
migrate_to_node mm/mempolicy.c:1004
do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
SYSC_migrate_pages mm/mempolicy.c:1453
SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185

other info that might help us debug this:

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(split_queue_lock);
local_irq_disable();
lock(slock-AF_INET);
lock(split_queue_lock);

lock(slock-AF_INET);

Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Acked-by: David Rientjes
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-21 09:09:18 +0800
629d9d1ca mm: avoid uninitialized variable in tracepoint ... Browse Code »

A newly added tracepoint in the hugepage code uses a variable in the
error handling that is not initialized at that point:

include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]

The result is relatively harmless, as the trace data will in rare
cases contain incorrect data.

This works around the problem by adding an explicit initialization.

Signed-off-by: Arnd Bergmann
Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
Reviewed-by: Ebru Akagunduz
Acked-by: David Rientjes
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-01-21 01:21:41 +0800

18 Jan, 2016

1 commit

25eedabe0 vm: fix incorrect unlock error path in madvise_free_huge_pmd ... Browse Code »

Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
MADV_FREE syscall is called") introduced this new function, but got the
error handling for when pmd_trans_huge_lock() fails wrong. In the
failure case, the lock has not been taken, and we should not unlock on
the way out.

Cc: Minchan Kim
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-01-18 10:33:15 +0800

16 Jan, 2016

14 commits

3565fce3a mm, x86: get_user_pages() for dax mappings ... Browse Code »

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams
Tested-by: Logan Gunthorpe
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
5c7fb56e5 mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd ... Browse Code »

A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.

Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.

[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Matthew Wilcox
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
f25748e3c mm, dax: convert vmf_insert_pfn_pmd() to pfn_t ... Browse Code »

Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.

Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: Matthew Wilcox
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
01871e59a mm, dax: fix livelock, allow dax pmd mappings to become writeable ... Browse Code »

Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.

Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.

Signed-off-by: Ross Zwisler
Signed-off-by: Dan Williams
Reported-by: Jeff Moyer
Reported-by: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-01-16 09:56:32 +0800
bd56086f1 thp: fix split_huge_page() after mremap() of THP ... Browse Code »

Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
(!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.

The cause is that split_huge_page() doesn't handle THP correctly if it's
not allingned to PMD boundary. It can happen after mremap().

Test-case (not always triggers the bug):

#define _GNU_SOURCE
#include
#include
#include

#define MB (1024UL*1024)
#define SIZE (2*MB)
#define BASE ((void *)0x400000000000)

int main()
{
char *p;

p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
BASE + SIZE + 8192);
if (p == MAP_FAILED)
perror("mremap"), exit(1);
system("echo 1 > /sys/kernel/debug/split_huge_pages");
return 0;
}

The patch fixes freeze and unfreeze paths to handle page table boundary
crossing.

It also makes mapcount vs count check in split_huge_page_to_list()
stricter:
- after freeze we don't expect any subpage mapped as we remove them
from rmap when setting up migration entries;
- count must be 1, meaning only caller has reference to the page;

[1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a

Signed-off-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
b8d3c4c30 mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called ... Browse Code »

We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.

Signed-off-by: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Michal Hocko
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 09:56:32 +0800
d96543223 thp: increase split_huge_page() success rate ... Browse Code »

During freeze_page(), we remove the page from rmap. It munlocks the
page if it was mlocked. clear_page_mlock() uses thelru cache, which
temporary pins the page.

Let's drain the lru cache before checking page's count vs. mapcount.
The change makes mlocked page split on first attempt, if it was not
pinned by somebody else.

Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
49071d436 thp: add debugfs handle to split all huge pages ... Browse Code »

Writing 1 into 'split_huge_pages' will try to find and split all huge
pages in the system. This is useful for debuging.

[akpm@linux-foundation.org: fix printk text, per Vlastimil]
Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Cc: Andrea Arcangeli
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
b20ce5e03 mm: prepare page_referenced() and page_idle to new THP refcounting ... Browse Code »

Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov
Cc: Vladimir Davydov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Sasha Levin
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
e90309c9f thp: allow mlocked THP again ... Browse Code »

Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.

With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.

We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.

Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().

This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.

With this approach we shouldn't hit situation like described above.

Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
9a982250f thp: introduce deferred_split_huge_page() ... Browse Code »

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
e9b61f198 thp: reintroduce split_huge_page() ... Browse Code »

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.

- Freeze the page counters by splitting all PMD and setup migration
PTEs.

- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.

- Split compound page.

- Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
ba9882808 thp: add option to setup migration entries during PMD split ... Browse Code »

We are going to use migration PTE entries to stabilize page counts. If
the page is mapped with PMDs we need to split the PMD and setup
migration entries. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
eef1b3ba0 thp: implement split_huge_pmd() ... Browse Code »

Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800