Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

13 Jun, 2014

1 commit

16b905780 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.

In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...

Linus Torvalds
2014-06-13 01:30:18 +0800

09 Jun, 2014

1 commit

b738d7646 Don't trigger congestion wait on dirty-but-not-writeout pages ... Browse Code »
5

shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.

The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.

We'll still trigger the congestion waiting if

(a) the process is kswapd, and we hit pages flagged for immediate
reclaim

(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.

This probably needs to be revisited, but as it is this fixes a reported
regression.

Reported-by: Felipe Contreras
Pinpointed-by: Hillf Danton
Cc: Michal Hocko
Cc: Andrew Morton
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-06-09 05:17:00 +0800

07 Jun, 2014

3 commits

b1de0d139 mm: convert some level-less printks to pr_* ... Browse Code »

printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitchel Humpherys
2014-06-07 07:08:18 +0800
688eb988d vmscan: memcg: always use swappiness of the reclaimed memcg ... Browse Code »

Memory reclaim always uses swappiness of the reclaim target memcg
(origin of the memory pressure) or vm_swappiness for global memory
reclaim. This behavior was consistent (except for difference between
global and hard limit reclaim) because swappiness was enforced to be
consistent within each memcg hierarchy.

After "mm: memcontrol: remove hierarchy restrictions for swappiness and
oom_control" each memcg can have its own swappiness independent of
hierarchical parents, though, so the consistency guarantee is gone.
This can lead to an unexpected behavior. Say that a group is explicitly
configured to not swapout by memory.swappiness=0 but its memory gets
swapped out anyway when the memory pressure comes from its parent with a
It is also unexpected that the knob is meaningless without setting the
hard limit which would trigger the reclaim and enforce the swappiness.
There are setups where the hard limit is configured higher in the
hierarchy by an administrator and children groups are under control of
somebody else who is interested in the swapout behavior but not
necessarily about the memory limit.

From a semantic point of view swappiness is an attribute defining anon
vs.
file proportional scanning of LRU which is memcg specific (unlike
charges which are propagated up the hierarchy) so it should be applied
to the particular memcg's LRU regardless where the memory pressure comes
from.

This patch removes vmscan_swappiness() and stores the swappiness into
the scan_control structure. mem_cgroup_swappiness is then used to
provide the correct value before shrink_lruvec is called. The global
vm_swappiness is used for the root memcg.

[hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-06-07 07:08:17 +0800
71abdc15a mm: vmscan: clear kswapd's special reclaim powers before exiting ... Browse Code »
5

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner
Reported-by: Gu Zheng
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-06-07 07:08:06 +0800

05 Jun, 2014

9 commits

1a501907b mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY ... Browse Code »
6

Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim. The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers. Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs. This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.

Based on ext4 and using the Intel VM scalability test

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Tim Chen
Cc: Dave Chinner
Tested-by: Yuanhan Liu
Cc: Bob Liu
Cc: Jan Kara
Cc: Rik van Riel
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:12 +0800
4be89a346 mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. ... Browse Code »

Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
/ KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
use DIV_ROUND_UP macro for neater code and clear meaning.

Besides, the gap value is calculated against the per-zone "managed pages",
not "present pages". This patch also corrects the comment and do some
rephrasing.

Signed-off-by: Jianyu Zhan
Acked-by: Rik van Riel
Acked-by: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:11 +0800
b745bc85f mm: page_alloc: convert hot/cold parameter and immediate callers to bool ... Browse Code »
4

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
df9024a8c mm: shrinker: add nid to tracepoint output ... Browse Code »

Now that we are doing NUMA-aware shrinking, and can have shrinkers
running in parallel, or working on individual nodes, it seems like we
should also be sticking the node in the output.

Signed-off-by: Dave Hansen
Acked-by: Dave Chinner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:54:04 +0800
7fe704759 mm: shrinker trace points: fix negatives ... Browse Code »

I was looking at a trace of the slab shrinkers (attachment in this comment):

https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67

and noticed that "total_scan" can go negative in some cases. We
used to dump out the "total_scan" variable directly, but some of
the shrinker modifications along the way changed that.

This patch just dumps it out directly, again. It doesn't make
any sense to derive it from new_nr and nr any more since there
are now other shrinkers that can be running in parallel and
mucking with those values.

Here's an example of the negative numbers in the output:

> kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
> kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
> kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
> kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
> kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
> kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
> kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
> kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
> kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
> kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
> kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512

Signed-off-by: Dave Hansen
Acked-by: Dave Chinner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:54:04 +0800
399ba0b95 mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads ... Browse Code »

When a loopback NFS mount is active and the backing device for the NFS
mount becomes congested, that can impose throttling delays on the nfsd
threads.

These delays significantly reduce throughput and so the NFS mount remains
congested.

This results in a livelock and the reduced throughput persists.

This livelock has been found in testing with the 'wait_iff_congested'
call, and could possibly be caused by the 'congestion_wait' call.

This livelock is similar to the deadlock which justified the introduction
of PF_LESS_THROTTLE, and the same flag can be used to remove this
livelock.

To minimise the impact of the change, we still throttle nfsd when the
filesystem it is writing to is congested, but not when some separate
filesystem (e.g. the NFS filesystem) is congested.

Signed-off-by: NeilBrown
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2014-06-05 07:54:01 +0800
675becce1 mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL ... Browse Code »
15

throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.

On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:01 +0800
bfc8c9013 mem-hotplug: implement get/put_online_mems ... Browse Code »

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
6f04f48dc mm: only force scan in reclaim when none of the LRUs are big enough. ... Browse Code »

Prior to this change, we would decide whether to force scan a LRU during
reclaim if that LRU itself was too small for the current priority.
However, this can lead to the file LRU getting force scanned even if
there are a lot of anonymous pages we can reclaim, leading to hot file
pages getting needlessly reclaimed.

To address this, we instead only force scan when none of the reclaimable
LRUs are big enough.

Gives huge improvements with zswap. For example, when doing -j20 kernel
build in a 500MB container with zswap enabled, runtime (in seconds) is
greatly reduced:

x without this change
+ with this change
N Min Max Median Avg Stddev
x 5 700.997 790.076 763.928 754.05 39.59493
+ 5 141.634 197.899 155.706 161.9 21.270224
Difference at 95.0% confidence
-592.15 +/- 46.3521
-78.5293% +/- 6.14709%
(Student's t, pooled s = 31.7819)

Should also give some improvements in regular (non-zswap) swap cases.

Yes, hughd found significant speedup using regular swap, with several
memcgs under pressure; and it should also be effective in the non-memcg
case, whenever one or another zone LRU is forced too small.

Signed-off-by: Suleiman Souhlal
Signed-off-by: Hugh Dickins
Cc: Suleiman Souhlal
Cc: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Rafael Aquini
Cc: Michal Hocko
Cc: Yuanhan Liu
Cc: Seth Jennings
Cc: Bob Liu
Cc: Minchan Kim
Cc: Luigi Semenzato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suleiman Souhlal
2014-06-05 07:53:56 +0800

07 May, 2014

2 commits

8174202b3 write_iter variants of {__,}generic_file_aio_write() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:38:00 +0800
623762517 revert "mm: vmscan: do not swap anon pages just because free+file is low" ... Browse Code »
10

This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
just because free+file is low") because it introduced a regression in
mostly-anonymous workloads, where reclaim would become ineffective and
trap every allocating task in direct reclaim.

The problem is that there is a runaway feedback loop in the scan balance
between file and anon, where the balance tips heavily towards a tiny
thrashing file LRU and anonymous pages are no longer being looked at.
The commit in question removed the safe guard that would detect such
situations and respond with forced anonymous reclaim.

This commit was part of a series to fix premature swapping in loads with
relatively little cache, and while it made a small difference, the cure
is obviously worse than the disease. Revert it.

Signed-off-by: Johannes Weiner
Reported-by: Christian Borntraeger
Acked-by: Christian Borntraeger
Acked-by: Rafael Aquini
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-07 04:04:59 +0800

19 Apr, 2014

1 commit

83da75100 vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() ... Browse Code »
7

Seems to be called with preemption enabled. Therefore it must use
mod_zone_page_state instead.

Signed-off-by: Christoph Lameter
Reported-by: Grygorii Strashko
Tested-by: Grygorii Strashko
Cc: Tejun Heo
Cc: Santosh Shilimkar
Cc: Ingo Molnar
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-04-19 07:40:07 +0800

09 Apr, 2014

1 commit

0bf1457f0 mm: vmscan: do not swap anon pages just because free+file is low ... Browse Code »
24

Page reclaim force-scans / swaps anonymous pages when file cache drops
below the high watermark of a zone in order to prevent what little cache
remains from thrashing.

However, on bigger machines the high watermark value can be quite large
and when the workload is dominated by a static anonymous/shmem set, the
file set might just be a small window of used-once cache. In such
situations, the VM starts swapping heavily when instead it should be
recycling the no longer used cache.

This is a longer-standing problem, but it's more likely to trigger after
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy")
because file pages can no longer accumulate in a single zone and are
dispersed into smaller fractions among the available zones.

To resolve this, do not force scan anon when file pages are low but
instead rely on the scan/rotation ratios to make the right prediction.

Signed-off-by: Johannes Weiner
Acked-by: Rafael Aquini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Suleiman Souhlal
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-09 07:48:51 +0800

08 Apr, 2014

2 commits

9bbc04eeb mm/vmscan: do not check compaction_ready on promoted zones ... Browse Code »

We abort direct reclaim if we find the zone is ready for compaction.
Sometimes the zone is just a promoted highmem zone to force a scan of
highmem, which is not the intended zone the caller want to allocate a
page from. In this situation, setting aborted_reclaim to indicate the
caller turned back to retry the allocation is waste of time and could
cause a loop in __alloc_pages_slowpath().

This patch does not check compaction_ready() on promoted zones to avoid
the above situation. Only set aborted_reclaim if the caller intended
zone is ready for compaction.

Signed-off-by: Weijie Yang
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2014-04-08 07:35:50 +0800
619d0d76c mm/vmscan: restore sc->gfp_mask after promoting it to __GFP_HIGHMEM ... Browse Code »

We promote sc->gfp_mask to __GFP_HIGHMEM to forcibly scan highmem if
there are too many buffer_heads pinning highmem. See cc715d99e5 ("mm:
vmscan: forcibly scan highmem if there are too many buffer_heads pinning
highmem").

This patch restores sc->gfp_mask to its caller original value after
finishing the scan job, to avoid the impact on other invocations from
its upper caller, such as vmpressure_prio(), shrink_slab().

Signed-off-by: Weijie Yang
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2014-04-08 07:35:50 +0800

04 Apr, 2014

6 commits

a528910e1 mm: thrash detection-based file cache sizing ... Browse Code »
2

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
91b0abe36 mm + fs: store shadow entries in page cache ... Browse Code »

Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
d5bc5fd3f mm: vmscan: shrink_slab: rename max_pass -> freeable ... Browse Code »
7

The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that. Rename it to
reflect its actual meaning.

Signed-off-by: Vladimir Davydov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:21:00 +0800
3115cd914 mm: vmscan: remove shrink_control arg from do_try_to_free_pages() ... Browse Code »

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800
65ec02cb9 mm: vmscan: move call to shrink_slab() to shrink_zones() ... Browse Code »

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov
Reviewed-by: Glauber Costa
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800
99120b772 mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim ... Browse Code »
7

When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.

Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-04-04 07:20:58 +0800

30 Jan, 2014

1 commit

a1c3bfb2f mm/page-writeback.c: do not count anon pages as dirtyable memory ... Browse Code »

The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.

A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.

Only consider free pages and file pages dirtyable.

Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Tested-by: Tejun Heo
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Wu Fengguang
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-01-30 08:22:39 +0800

24 Jan, 2014

3 commits

ec97097bc mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask ... Browse Code »
2

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov
Reported-by: Dave Chinner
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:52 +0800
0b1fb40a3 mm: vmscan: shrink all slab objects if tight on memory ... Browse Code »
2

When reclaiming kmem, we currently don't scan slabs that have less than
batch_size objects (see shrink_slab_node()):

while (total_scan >= batch_size) {
shrinkctl->nr_to_scan = batch_size;
shrinker->scan_objects(shrinker, shrinkctl);
total_scan -= batch_size;
}

If there are only a few shrinkers available, such a behavior won't cause
any problems, because the batch_size is usually small, but if we have a
lot of slab shrinkers, which is perfectly possible since FS shrinkers
are now per-superblock, we can end up with hundreds of megabytes of
practically unreclaimable kmem objects. For instance, mounting a
thousand of ext2 FS images with a hundred of files in each and iterating
over all the files using du(1) will result in about 200 Mb of FS caches
that cannot be dropped even with the aid of the vm.drop_caches sysctl!

This problem was initially pointed out by Glauber Costa [*]. Glauber
proposed to fix it by making the shrink_slab() always take at least one
pass, to put it simply, turning the scan loop above to a do{}while()
loop. However, this proposal was rejected, because it could result in
more aggressive and frequent slab shrinking even under low memory
pressure when total_scan is naturally very small.

This patch is a slightly modified version of Glauber's approach.
Similarly to Glauber's patch, it makes shrink_slab() scan less than
batch_size objects, but only if the total number of objects we want to
scan (total_scan) is greater than the total number of objects available
(max_pass). Since total_scan is biased as half max_pass if the current
delta change is small:

if (delta < max_pass / 4)
total_scan = min(total_scan, max_pass / 2);

this is only possible if we are scanning at high prio. That said, this
patch shouldn't change the vmscan behaviour if the memory pressure is
low, but if we are tight on memory, we will do our best by trying to
reclaim all available objects, which sounds reasonable.

[*] http://www.spinics.net/lists/cgroups/msg06913.html

Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-24 08:36:52 +0800
309381fea mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE ... Browse Code »

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-01-24 08:36:50 +0800

17 Oct, 2013

1 commit

ae3933216 mm/vmscan.c: don't forget to free shrinker->nr_deferred ... Browse Code »

This leak was added by commit 1d3d4437eae1 ("vmscan: per-node deferred
work").

unreferenced object 0xffff88006ada3bd0 (size 8):
comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
hex dump (first 8 bytes):
00 00 00 00 00 00 00 00 ........
backtrace:
[] kmemleak_alloc+0x5e/0xc0
[] __kmalloc+0x247/0x310
[] register_shrinker+0x3c/0xa0
[] sget+0x5ab/0x670
[] proc_mount+0x54/0x170
[] mount_fs+0x43/0x1b0
[] vfs_kern_mount+0x72/0x110
[] kern_mount_data+0x19/0x30
[] pid_ns_prepare_proc+0x20/0x40
[] alloc_pid+0x466/0x4a0
[] copy_process+0xc6a/0x1860
[] do_fork+0x8b/0x370
[] SyS_clone+0x16/0x20
[] stub_clone+0x69/0x90
[] 0xffffffffffffffff

Signed-off-by: Andrew Vagin
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Glauber Costa
Cc: Chuck Lever
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Vagin
2013-10-17 12:35:52 +0800

01 Oct, 2013

1 commit

117aad1e9 mm: avoid reinserting isolated balloon pages into LRU lists ... Browse Code »

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] shrink_page_list+0x24e/0x897
PGD 3cda2067 PUD 3d713067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: shrink_page_list+0x24e/0x897
RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
Call Trace:
shrink_inactive_list+0x240/0x3de
shrink_lruvec+0x3e0/0x566
__shrink_zone+0x94/0x178
shrink_zone+0x3a/0x82
balance_pgdat+0x32a/0x4c2
kswapd+0x2f0/0x372
kthread+0xa2/0xaa
ret_from_fork+0x7c/0xb0
Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
RIP [] shrink_page_list+0x24e/0x897
RSP
CR2: 0000000000000028
---[ end trace 703d2451af6ffbfd ]---
Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: Rafael Aquini
Reported-by: Luiz Capitulino
Tested-by: Luiz Capitulino
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2013-10-01 05:31:02 +0800

25 Sep, 2013

5 commits

0608f43da revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" ... Browse Code »

Revert commit 3b38722efd9f ("memcg, vmscan: integrate soft reclaim
tighter with zone shrinking code")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
b1aff7fcf revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim" ... Browse Code »

Revert commit a5b7c87f9207 ("vmscan, memcg: do softlimit reclaim also
for targeted reclaim")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
694fbc0fe revert "memcg: enhance memcg iterator to support predicates" ... Browse Code »

Revert commit de57780dc659 ("memcg: enhance memcg iterator to support
predicates")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:26 +0800
3120055e8 revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything" ... Browse Code »

Revert commit e839b6a1c8d0 ("memcg, vmscan: do not attempt soft limit
reclaim if it would not scan anything")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800
20ba27f52 revert "memcg, vmscan: do not fall into reclaim-all pass too quickly" ... Browse Code »

Revert commit e975de998b96 ("memcg, vmscan: do not fall into reclaim-all
pass too quickly")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-25 08:00:25 +0800

13 Sep, 2013

3 commits

ac4de9543 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"

* emailed patches from Andrew Morton : (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...

Linus Torvalds
2013-09-13 06:44:27 +0800
f894ffa86 memcg: trivial cleanups ... Browse Code »

Clean up some mess made by the "Soft limit rework" series, and a few other
things.

Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-13 06:38:01 +0800
e975de998 memcg, vmscan: do not fall into reclaim-all pass too quickly ... Browse Code »

shrink_zone starts with soft reclaim pass first and then falls back to
regular reclaim if nothing has been scanned. This behavior is natural
but there is a catch. Memcg iterators, when used with the reclaim
cookie, are designed to help to prevent from over reclaim by
interleaving reclaimers (per node-zone-priority) so the tree walk might
miss many (even all) nodes in the hierarchy e.g. when there are direct
reclaimers racing with each other or with kswapd in the global case or
multiple allocators reaching the limit for the target reclaim case. To
make it even more complicated, targeted reclaim doesn't do the whole
tree walk because it stops reclaiming once it reclaims sufficient pages.
As a result groups over the limit might be missed, thus nothing is
scanned, and reclaim would fall back to the reclaim all mode.

This patch checks for the incomplete tree walk in shrink_zone. If no
group has been visited and the hierarchy is soft reclaimable then we
must have missed some groups, in which case the __shrink_zone is called
again. This doesn't guarantee there will be some progress of course
because the current reclaimer might be still racing with others but it
would at least give a chance to start the walk without a big risk of
reclaim latencies.

Signed-off-by: Michal Hocko
Cc: Balbir Singh
Cc: Glauber Costa
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Michel Lespinasse
Cc: Tejun Heo
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-09-13 06:38:01 +0800