13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

09 Jun, 2014

1 commit

  • shrink_inactive_list() used to wait 0.1s to avoid congestion when all
    the pages that were isolated from the inactive list were dirty but not
    under active writeback. That makes no real sense, and apparently causes
    major interactivity issues under some loads since 3.11.

    The ostensible reason for it was to wait for kswapd to start writing
    pages, but that seems questionable as well, since the congestion wait
    code seems to trigger for kswapd itself as well. Also, the logic behind
    delaying anything when we haven't actually started writeback is not
    clear - it only delays actually starting that writeback.

    We'll still trigger the congestion waiting if

    (a) the process is kswapd, and we hit pages flagged for immediate
    reclaim

    (b) the process is not kswapd, and the zone backing dev writeback is
    actually congested.

    This probably needs to be revisited, but as it is this fixes a reported
    regression.

    Reported-by: Felipe Contreras
    Pinpointed-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Jun, 2014

3 commits

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     
  • Memory reclaim always uses swappiness of the reclaim target memcg
    (origin of the memory pressure) or vm_swappiness for global memory
    reclaim. This behavior was consistent (except for difference between
    global and hard limit reclaim) because swappiness was enforced to be
    consistent within each memcg hierarchy.

    After "mm: memcontrol: remove hierarchy restrictions for swappiness and
    oom_control" each memcg can have its own swappiness independent of
    hierarchical parents, though, so the consistency guarantee is gone.
    This can lead to an unexpected behavior. Say that a group is explicitly
    configured to not swapout by memory.swappiness=0 but its memory gets
    swapped out anyway when the memory pressure comes from its parent with a
    It is also unexpected that the knob is meaningless without setting the
    hard limit which would trigger the reclaim and enforce the swappiness.
    There are setups where the hard limit is configured higher in the
    hierarchy by an administrator and children groups are under control of
    somebody else who is interested in the swapout behavior but not
    necessarily about the memory limit.

    From a semantic point of view swappiness is an attribute defining anon
    vs.
    file proportional scanning of LRU which is memcg specific (unlike
    charges which are propagated up the hierarchy) so it should be applied
    to the particular memcg's LRU regardless where the memory pressure comes
    from.

    This patch removes vmscan_swappiness() and stores the swappiness into
    the scan_control structure. mem_cgroup_swappiness is then used to
    provide the correct value before shrink_lruvec is called. The global
    vm_swappiness is used for the root memcg.

    [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When kswapd exits, it can end up taking locks that were previously held
    by allocating tasks while they waited for reclaim. Lockdep currently
    warns about this:

    On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
    > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
    > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
    > {RECLAIM_FS-ON-W} state was registered at:
    > mark_held_locks+0xb9/0x140
    > lockdep_trace_alloc+0x7a/0xe0
    > kmem_cache_alloc_trace+0x37/0x240
    > flex_array_alloc+0x99/0x1a0
    > cgroup_attach_task+0x63/0x430
    > attach_task_by_pid+0x210/0x280
    > cgroup_procs_write+0x16/0x20
    > cgroup_file_write+0x120/0x2c0
    > vfs_write+0xc0/0x1f0
    > SyS_write+0x4c/0xa0
    > tracesys+0xdd/0xe2
    > irq event stamp: 49
    > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
    > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
    > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
    > softirqs last disabled at (0): (null)
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(&sig->group_rwsem);
    >
    > lock(&sig->group_rwsem);
    >
    > *** DEADLOCK ***
    >
    > no locks held by kswapd2/1151.
    >
    > stack backtrace:
    > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
    > Call Trace:
    > dump_stack+0x19/0x1b
    > print_usage_bug+0x1f7/0x208
    > mark_lock+0x21d/0x2a0
    > __lock_acquire+0x52a/0xb60
    > lock_acquire+0xa2/0x140
    > down_read+0x51/0xa0
    > exit_signals+0x24/0x130
    > do_exit+0xb5/0xa50
    > kthread+0xdb/0x100
    > ret_from_fork+0x7c/0xb0

    This is because the kswapd thread is still marked as a reclaimer at the
    time of exit. But because it is exiting, nobody is actually waiting on
    it to make reclaim progress anymore, and it's nothing but a regular
    thread at this point. Be tidy and strip it of all its powers
    (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
    before returning from the thread function.

    Signed-off-by: Johannes Weiner
    Reported-by: Gu Zheng
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jun, 2014

9 commits

  • Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim. The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers. Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs. This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.

    Based on ext4 and using the Intel VM scalability test

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
    Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
    Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
    Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
    Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Minor Faults 35154 36784
    Major Faults 611 1305
    Swap Ins 394 1651
    Swap Outs 4394 5891
    Allocation stalls 118616 44781
    Direct pages scanned 4935171 4602313
    Kswapd pages scanned 15921292 16258483
    Kswapd pages reclaimed 15913301 16248305
    Direct pages reclaimed 4933368 4601133
    Kswapd efficiency 99% 99%
    Kswapd velocity 670088.047 682555.961
    Direct efficiency 99% 99%
    Direct velocity 207709.217 193212.133
    Percentage direct scans 23% 22%
    Page writes by reclaim 4858.000 6232.000
    Page writes file 464 341
    Page writes anon 4394 5891

    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tim Chen
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that we are doing NUMA-aware shrinking, and can have shrinkers
    running in parallel, or working on individual nodes, it seems like we
    should also be sticking the node in the output.

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I was looking at a trace of the slab shrinkers (attachment in this comment):

    https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67

    and noticed that "total_scan" can go negative in some cases. We
    used to dump out the "total_scan" variable directly, but some of
    the shrinker modifications along the way changed that.

    This patch just dumps it out directly, again. It doesn't make
    any sense to derive it from new_nr and nr any more since there
    are now other shrinkers that can be running in parallel and
    mucking with those values.

    Here's an example of the negative numbers in the output:

    > kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
    > kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
    > kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
    > kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
    > kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
    > kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
    > kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
    > kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
    > kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
    > kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
    > kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When a loopback NFS mount is active and the backing device for the NFS
    mount becomes congested, that can impose throttling delays on the nfsd
    threads.

    These delays significantly reduce throughput and so the NFS mount remains
    congested.

    This results in a livelock and the reduced throughput persists.

    This livelock has been found in testing with the 'wait_iff_congested'
    call, and could possibly be caused by the 'congestion_wait' call.

    This livelock is similar to the deadlock which justified the introduction
    of PF_LESS_THROTTLE, and the same flag can be used to remove this
    livelock.

    To minimise the impact of the change, we still throttle nfsd when the
    filesystem it is writing to is congested, but not when some separate
    filesystem (e.g. the NFS filesystem) is congested.

    Signed-off-by: NeilBrown
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • throttle_direct_reclaim() is meant to trigger during swap-over-network
    during which the min watermark is treated as a pfmemalloc reserve. It
    throttes on the first node in the zonelist but this is flawed.

    The user-visible impact is that a process running on CPU whose local
    memory node has no ZONE_NORMAL will stall for prolonged periods of time,
    possibly indefintely. This is due to throttle_direct_reclaim thinking the
    pfmemalloc reserves are depleted when in fact they don't exist on that
    node.

    On a NUMA machine running a 32-bit kernel (I know) allocation requests
    from CPUs on node 1 would detect no pfmemalloc reserves and the process
    gets throttled. This patch adjusts throttling of direct reclaim to
    throttle based on the first node in the zonelist that has a usable
    ZONE_NORMAL or lower zone.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Prior to this change, we would decide whether to force scan a LRU during
    reclaim if that LRU itself was too small for the current priority.
    However, this can lead to the file LRU getting force scanned even if
    there are a lot of anonymous pages we can reclaim, leading to hot file
    pages getting needlessly reclaimed.

    To address this, we instead only force scan when none of the reclaimable
    LRUs are big enough.

    Gives huge improvements with zswap. For example, when doing -j20 kernel
    build in a 500MB container with zswap enabled, runtime (in seconds) is
    greatly reduced:

    x without this change
    + with this change
    N Min Max Median Avg Stddev
    x 5 700.997 790.076 763.928 754.05 39.59493
    + 5 141.634 197.899 155.706 161.9 21.270224
    Difference at 95.0% confidence
    -592.15 +/- 46.3521
    -78.5293% +/- 6.14709%
    (Student's t, pooled s = 31.7819)

    Should also give some improvements in regular (non-zswap) swap cases.

    Yes, hughd found significant speedup using regular swap, with several
    memcgs under pressure; and it should also be effective in the non-memcg
    case, whenever one or another zone LRU is forced too small.

    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Cc: Yuanhan Liu
    Cc: Seth Jennings
    Cc: Bob Liu
    Cc: Minchan Kim
    Cc: Luigi Semenzato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suleiman Souhlal
     

07 May, 2014

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
    just because free+file is low") because it introduced a regression in
    mostly-anonymous workloads, where reclaim would become ineffective and
    trap every allocating task in direct reclaim.

    The problem is that there is a runaway feedback loop in the scan balance
    between file and anon, where the balance tips heavily towards a tiny
    thrashing file LRU and anonymous pages are no longer being looked at.
    The commit in question removed the safe guard that would detect such
    situations and respond with forced anonymous reclaim.

    This commit was part of a series to fix premature swapping in loads with
    relatively little cache, and while it made a small difference, the cure
    is obviously worse than the disease. Revert it.

    Signed-off-by: Johannes Weiner
    Reported-by: Christian Borntraeger
    Acked-by: Christian Borntraeger
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

19 Apr, 2014

1 commit


09 Apr, 2014

1 commit

  • Page reclaim force-scans / swaps anonymous pages when file cache drops
    below the high watermark of a zone in order to prevent what little cache
    remains from thrashing.

    However, on bigger machines the high watermark value can be quite large
    and when the workload is dominated by a static anonymous/shmem set, the
    file set might just be a small window of used-once cache. In such
    situations, the VM starts swapping heavily when instead it should be
    recycling the no longer used cache.

    This is a longer-standing problem, but it's more likely to trigger after
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy")
    because file pages can no longer accumulate in a single zone and are
    dispersed into smaller fractions among the available zones.

    To resolve this, do not force scan anon when file pages are low but
    instead rely on the scan/rotation ratios to make the right prediction.

    Signed-off-by: Johannes Weiner
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2014

2 commits

  • We abort direct reclaim if we find the zone is ready for compaction.
    Sometimes the zone is just a promoted highmem zone to force a scan of
    highmem, which is not the intended zone the caller want to allocate a
    page from. In this situation, setting aborted_reclaim to indicate the
    caller turned back to retry the allocation is waste of time and could
    cause a loop in __alloc_pages_slowpath().

    This patch does not check compaction_ready() on promoted zones to avoid
    the above situation. Only set aborted_reclaim if the caller intended
    zone is ready for compaction.

    Signed-off-by: Weijie Yang
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • We promote sc->gfp_mask to __GFP_HIGHMEM to forcibly scan highmem if
    there are too many buffer_heads pinning highmem. See cc715d99e5 ("mm:
    vmscan: forcibly scan highmem if there are too many buffer_heads pinning
    highmem").

    This patch restores sc->gfp_mask to its caller original value after
    finishing the scan job, to avoid the impact on other invocations from
    its upper caller, such as vmpressure_prio(), shrink_slab().

    Signed-off-by: Weijie Yang
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     

04 Apr, 2014

6 commits

  • The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently usedbut
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This patch solves one half of the problem by decoupling the ability to
    detect working set changes from the inactive list size. By maintaining
    a history of recently evicted file pages it can detect frequently used
    pages with an arbitrarily small inactive list size, and subsequently
    apply pressure on the active list based on actual demand for cache, not
    just overall eviction speed.

    Every zone maintains a counter that tracks inactive list aging speed.
    When a page is evicted, a snapshot of this counter is stored in the
    now-empty page cache radix tree slot. On refault, the minimum access
    distance of the page can be assessed, to evaluate whether the page
    should be part of the active list or not.

    This fixes the VM's blindness towards working set changes in excess of
    the inactive list. And it's the foundation to further improve the
    protection ability and reduce the minimum inactive list size of 50%.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The name `max_pass' is misleading, because this variable actually keeps
    the estimate number of freeable objects, not the maximal number of
    objects we can scan in this pass, which can be twice that. Rename it to
    reflect its actual meaning.

    Signed-off-by: Vladimir Davydov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There is no need passing on a shrink_control struct from
    try_to_free_pages() and friends to do_try_to_free_pages() and then to
    shrink_zones(), because it is only used in shrink_zones() and the only
    field initialized on the top level is gfp_mask, which is always equal to
    scan_control.gfp_mask. So let's move shrink_control initialization to
    shrink_zones().

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This reduces the indentation level of do_try_to_free_pages() and removes
    extra loop over all eligible zones counting the number of on-LRU pages.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Glauber Costa
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When direct reclaim is executed by a process bound to a set of NUMA
    nodes, we should scan only those nodes when possible, but currently we
    will scan kmem from all online nodes even if the kmem shrinker is NUMA
    aware. That said, binding a process to a particular NUMA node won't
    prevent it from shrinking inode/dentry caches from other nodes, which is
    not good. Fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

30 Jan, 2014

1 commit

  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jan, 2014

3 commits

  • If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
    once with nid=0, but currently it is not true: if node 0 is not set in
    the nodemask or if it is not online, we will not call such shrinkers at
    all. As a result some slabs will be left untouched under some
    circumstances. Let us fix it.

    Signed-off-by: Vladimir Davydov
    Reported-by: Dave Chinner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When reclaiming kmem, we currently don't scan slabs that have less than
    batch_size objects (see shrink_slab_node()):

    while (total_scan >= batch_size) {
    shrinkctl->nr_to_scan = batch_size;
    shrinker->scan_objects(shrinker, shrinkctl);
    total_scan -= batch_size;
    }

    If there are only a few shrinkers available, such a behavior won't cause
    any problems, because the batch_size is usually small, but if we have a
    lot of slab shrinkers, which is perfectly possible since FS shrinkers
    are now per-superblock, we can end up with hundreds of megabytes of
    practically unreclaimable kmem objects. For instance, mounting a
    thousand of ext2 FS images with a hundred of files in each and iterating
    over all the files using du(1) will result in about 200 Mb of FS caches
    that cannot be dropped even with the aid of the vm.drop_caches sysctl!

    This problem was initially pointed out by Glauber Costa [*]. Glauber
    proposed to fix it by making the shrink_slab() always take at least one
    pass, to put it simply, turning the scan loop above to a do{}while()
    loop. However, this proposal was rejected, because it could result in
    more aggressive and frequent slab shrinking even under low memory
    pressure when total_scan is naturally very small.

    This patch is a slightly modified version of Glauber's approach.
    Similarly to Glauber's patch, it makes shrink_slab() scan less than
    batch_size objects, but only if the total number of objects we want to
    scan (total_scan) is greater than the total number of objects available
    (max_pass). Since total_scan is biased as half max_pass if the current
    delta change is small:

    if (delta < max_pass / 4)
    total_scan = min(total_scan, max_pass / 2);

    this is only possible if we are scanning at high prio. That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.

    [*] http://www.spinics.net/lists/cgroups/msg06913.html

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

17 Oct, 2013

1 commit

  • This leak was added by commit 1d3d4437eae1 ("vmscan: per-node deferred
    work").

    unreferenced object 0xffff88006ada3bd0 (size 8):
    comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
    hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00 ........
    backtrace:
    [] kmemleak_alloc+0x5e/0xc0
    [] __kmalloc+0x247/0x310
    [] register_shrinker+0x3c/0xa0
    [] sget+0x5ab/0x670
    [] proc_mount+0x54/0x170
    [] mount_fs+0x43/0x1b0
    [] vfs_kern_mount+0x72/0x110
    [] kern_mount_data+0x19/0x30
    [] pid_ns_prepare_proc+0x20/0x40
    [] alloc_pid+0x466/0x4a0
    [] copy_process+0xc6a/0x1860
    [] do_fork+0x8b/0x370
    [] SyS_clone+0x16/0x20
    [] stub_clone+0x69/0x90
    [] 0xffffffffffffffff

    Signed-off-by: Andrew Vagin
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Cc: Chuck Lever
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     

01 Oct, 2013

1 commit

  • Isolated balloon pages can wrongly end up in LRU lists when
    migrate_pages() finishes its round without draining all the isolated
    page list.

    The same issue can happen when reclaim_clean_pages_from_list() tries to
    reclaim pages from an isolated page list, before migration, in the CMA
    path. Such balloon page leak opens a race window against LRU lists
    shrinkers that leads us to the following kernel panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] shrink_page_list+0x24e/0x897
    PGD 3cda2067 PUD 3d713067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: shrink_page_list+0x24e/0x897
    RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
    RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
    RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
    R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
    R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
    FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
    Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
    Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
    RIP [] shrink_page_list+0x24e/0x897
    RSP
    CR2: 0000000000000028
    ---[ end trace 703d2451af6ffbfd ]---
    Kernel panic - not syncing: Fatal exception

    This patch fixes the issue, by assuring the proper tests are made at
    putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
    isolated balloon pages being wrongly reinserted in LRU lists.

    [akpm@linux-foundation.org: clarify awkward comment text]
    Signed-off-by: Rafael Aquini
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

25 Sep, 2013

5 commits


13 Sep, 2013

3 commits

  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • Clean up some mess made by the "Soft limit rework" series, and a few other
    things.

    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • shrink_zone starts with soft reclaim pass first and then falls back to
    regular reclaim if nothing has been scanned. This behavior is natural
    but there is a catch. Memcg iterators, when used with the reclaim
    cookie, are designed to help to prevent from over reclaim by
    interleaving reclaimers (per node-zone-priority) so the tree walk might
    miss many (even all) nodes in the hierarchy e.g. when there are direct
    reclaimers racing with each other or with kswapd in the global case or
    multiple allocators reaching the limit for the target reclaim case. To
    make it even more complicated, targeted reclaim doesn't do the whole
    tree walk because it stops reclaiming once it reclaims sufficient pages.
    As a result groups over the limit might be missed, thus nothing is
    scanned, and reclaim would fall back to the reclaim all mode.

    This patch checks for the incomplete tree walk in shrink_zone. If no
    group has been visited and the hierarchy is soft reclaimable then we
    must have missed some groups, in which case the __shrink_zone is called
    again. This doesn't guarantee there will be some progress of course
    because the current reclaimer might be still racing with others but it
    would at least give a chance to start the walk without a big risk of
    reclaim latencies.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko