Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

27 Jan, 2015

1 commit

17636faad mm/vmscan: fix highidx argument type ... Browse Code »

for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is
passed gfp_t:

mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx
mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask
mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types)
mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx
mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask

convert argument to the correct type.

Signed-off-by: Michael S. Tsirkin
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc: Suleiman Souhlal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael S. Tsirkin
2015-01-27 05:37:18 +0800

09 Jan, 2015

1 commit

9e5e36617 mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed ... Browse Code »
3

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
to be a combination of several factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep():

if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
wake_up(&pgdat->pfmemalloc_wait);
return false; // kswapd will not go to sleep
}

However, for a process that was already killed, wake_up() does not remove
the process from the waitqueue, since try_to_wake_up() checks its state
first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
encounters no voluntary preemption points and repeatedly fails
prepare_kswapd_sleep(), blocking the process from running and removing
itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled. This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties. To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above. Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed. Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka
Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Rik van Riel
Cc: [3.6+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-01-09 07:10:52 +0800

14 Dec, 2014

1 commit

6b4f7799c mm: vmscan: invoke slab shrinkers from shrink_zone() ... Browse Code »
12

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.

For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.

[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner
Cc: Dave Chinner
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-14 04:42:48 +0800

12 Dec, 2014

1 commit

2756d373a Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock

Linus Torvalds
2014-12-12 10:57:19 +0800

11 Dec, 2014

3 commits

ebff39801 mm, compaction: pass classzone_idx and alloc_flags to watermark checking ... Browse Code »

Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished(). The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request. But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.

The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request. Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account. This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail. Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.

The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist(). In these cases
compaction might be running for a longer time than is really needed.

Another issue compaction_suitable() is that the check for "does the zone
need compaction at all?" comes only after the check "does the zone have
enough free free pages to succeed compaction". The latter considers extra
pages for migration and can therefore in some situations fail and return
COMPACT_SKIPPED, although the high-order allocation would succeed and we
should return COMPACT_PARTIAL.

This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking. Where possible, all other callers of
compaction_suitable() pass proper values where those are known. This is
currently limited to classzone_idx, which is sometimes known in kswapd
context. However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could. This can be fixed later, if it's shown to be an issue.

Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.

The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred
compaction code in a followup patch.

When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Acked-by: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-12-11 09:41:06 +0800
1da58ee2a mm: vmscan: count only dirty pages as congested ... Browse Code »

shrink_page_list() counts all pages with a mapping, including clean pages,
toward nr_congested if they're on a write-congested BDI.
shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
nr_congested. Fix this apples-to-oranges comparison by only counting
pages for nr_congested if they count for nr_dirty.

Signed-off-by: Jamie Liu
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Greg Thelen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jamie Liu
2014-12-11 09:41:06 +0800
8612c6639 mm/vmscan.c: replace printk with pr_err ... Browse Code »

This patch replaces printk(KERN_ERR..) with pr_err found under
shrink_slab. Thus it also reduces one line extra because of formatting.

Signed-off-by: Pintu Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pintu Kumar
2014-12-11 09:41:05 +0800

27 Oct, 2014

1 commit

344736f29 cpuset: simplify cpuset_node_allowed API ... Browse Code »

Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Vladimir Davydov
2014-10-27 23:15:27 +0800

10 Oct, 2014

4 commits

b70a2a21d mm: memcontrol: fix transparent huge page allocations under pressure ... Browse Code »

In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of effort
that the allocator puts into assembling these pages.

The reason for this is that the memcg reclaim code was never designed for
higher-order charges. It reclaims in small batches until there is room
for at least one page. Huge page charges only succeed when these batches
add up over a series of huge faults, which is unlikely under any
significant load involving order-0 allocations in the group.

Remove that loop on the memcg side in favor of passing the actual reclaim
goal to direct reclaim, which is already set up and optimized to meet
higher-order goals efficiently.

This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make an
honest effort to charge it. As a result, transparent hugepage allocation
rates amid cache activity are drastically improved:

vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

[ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-10-10 10:25:59 +0800
570546517 mm: clean up zone flags ... Browse Code »

Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
reader through layers indirection just to track down a simple bit.

Remove all zone flag wrappers and just use bitops against zone->flags
directly. It's just as readable and the lines are barely any longer.

Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
remove the zone_flags_t typedef.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-10-10 10:25:57 +0800
1f13ae399 mm: remove noisy remainder of the scan_unevictable interface ... Browse Code »

The deprecation warnings for the scan_unevictable interface triggers by
scripts doing `sysctl -a | grep something else'. This is annoying and not
helpful.

The interface has been defunct since 264e56d8247e ("mm: disable user
interface to manually rescue unevictable pages"), which was in 2011, and
there haven't been any reports of usecases for it, only reports that the
deprecation warnings are annying. It's unlikely that anybody is using
this interface specifically at this point, so remove it.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-10-10 10:25:55 +0800
53853e2d2 mm, compaction: defer each zone individually instead of preferred zone ... Browse Code »
13

When direct sync compaction is often unsuccessful, it may become deferred
for some time to avoid further useless attempts, both sync and async.
Successful high-order allocations un-defer compaction, while further
unsuccessful compaction attempts prolong the compaction deferred period.

Currently the checking and setting deferred status is performed only on
the preferred zone of the allocation that invoked direct compaction. But
compaction itself is attempted on all eligible zones in the zonelist, so
the behavior is suboptimal and may lead both to scenarios where 1)
compaction is attempted uselessly, or 2) where it's not attempted despite
good chances of succeeding, as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set
deferred compaction for the Normal zone. Another unrelated direct
compaction with DMA32 as preferred zone will attempt to compact DMA32
zone even though the first compaction attempt also included DMA32 zone.

In another scenario, compaction with Normal preferred zone failed to
compact Normal zone, but succeeded in the DMA32 zone, so it will not
defer compaction. In the next attempt, it will try Normal zone which
will fail again, instead of skipping Normal zone and trying DMA32
directly.

2) Kswapd will balance DMA32 zone and reset defer status based on
watermarks looking good. A direct compaction with preferred Normal
zone will skip compaction of all zones including DMA32 because Normal
was still deferred. The allocation might have succeeded in DMA32, but
won't.

This patch makes compaction deferring work on individual zone basis
instead of preferred zone. For each zone, it checks compaction_deferred()
to decide if the zone should be skipped. If watermarks fail after
compacting the zone, defer_compaction() is called. The zone where
watermarks passed can still be deferred when the allocation attempt is
unsuccessful. When allocation is successful, compaction_defer_reset() is
called for the zone containing the allocated page. This approach should
approximate calling defer_compaction() only on zones where compaction was
attempted and did not yield allocated page. There might be corner cases
but that is inevitable as long as the decision to stop compacting dues not
guarantee that a page will be allocated.

Due to a new COMPACT_DEFERRED return value, some functions relying
implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
more accurate. The did_some_progress output parameter of
__alloc_pages_direct_compact() is removed completely, as the caller
actually does not use it after compaction sets it - it is only considered
when direct reclaim sets it.

During testing on a two-node machine with a single very small Normal zone
on node 1, this patch has improved success rates in stress-highalloc
mmtests benchmark. The success here were previously made worse by commit
3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
kswapd") as kswapd was no longer resetting often enough the deferred
compaction for the Normal zone, and DMA32 zones on both nodes were thus
not considered for compaction. On different machine, success rates were
improved with __GFP_NO_KSWAPD allocations.

[akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
Signed-off-by: Vlastimil Babka
Acked-by: Minchan Kim
Reviewed-by: Zhang Yanfei
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:53 +0800

09 Aug, 2014

2 commits

747db954c mm: memcontrol: use page lists for uncharge batching ... Browse Code »

Pages are now uncharged at release time, and all sources of batched
uncharges operate on lists of pages. Directly use those lists, and
get rid of the per-task batching state.

This also batches statistics accounting, in addition to the res
counter charges, to reduce IRQ-disabling and re-enabling.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Naoya Horiguchi
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:18 +0800
0a31bc97c mm: memcontrol: rewrite uncharge API ... Browse Code »
93

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Tested-by: Jet Chen
Acked-by: Michal Hocko
Tested-by: Felipe Balbi
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:17 +0800

07 Aug, 2014

9 commits

2ab051e11 memcg, vmscan: Fix forced scan of anonymous pages ... Browse Code »
4

When memory cgoups are enabled, the code that decides to force to scan
anonymous pages in get_scan_count() compares global values (free,
high_watermark) to a value that is restricted to a memory cgroup (file).
It make the code over-eager to force anon scan.

For instance, it will force anon scan when scanning a memcg that is
mainly populated by anonymous page, even when there is plenty of file
pages to get rid of in others memcgs, even when swappiness == 0. It
breaks user's expectation about swappiness and hurts performance.

This patch makes sure that forced anon scan only happens when there not
enough file pages for the all zone, not just in one random memcg.

[hannes@cmpxchg.org: cleanups]
Signed-off-by: Jerome Marchand
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-08-07 09:01:22 +0800
7c0db9e91 mm, vmscan: fix an outdated comment still mentioning get_scan_ratio ... Browse Code »

Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
however a comment in shrink_active_list() still mention it. This patch
fixes the outdated comment.

Signed-off-by: Jerome Marchand
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-08-07 09:01:22 +0800
0d5d823ab mm: move zone->pages_scanned into a vmstat counter ... Browse Code »
4

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.

3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-08-07 09:01:20 +0800
d0480be44 mm: update the description for vm_total_pages ... Browse Code »

vm_total_pages is calculated by nr_free_pagecache_pages(), which counts
the number of pages which are beyond the high watermark within all
zones. So vm_total_pages is not equal to total number of pages which
the VM controls.

Signed-off-by: Wang Sheng-Hui
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2014-08-07 09:01:20 +0800
ee814fe23 mm: vmscan: clean up struct scan_control ... Browse Code »

Reorder the members by input and output, then turn the individual
integers for may_writepage, may_unmap, may_swap, compaction_ready,
hibernation_mode into bit fields to save stack space:

+72/-296 -224
kswapd 104 176 +72
try_to_free_pages 80 56 -24
try_to_free_mem_cgroup_pages 80 56 -24
shrink_all_memory 88 64 -24
reclaim_clean_pages_from_list 168 144 -24
mem_cgroup_shrink_node_zone 104 80 -24
__zone_reclaim 176 152 -24
balance_pgdat 152 - -152

Signed-off-by: Johannes Weiner
Suggested-by: Mel Gorman
Acked-by: Mel Gorman
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:18 +0800
02695175c mm: vmscan: move swappiness out of scan_control ... Browse Code »
2

Swappiness is determined for each scanned memcg individually in
shrink_zone() and is not a parameter that applies throughout the reclaim
scan. Move it out of struct scan_control to prevent accidental use of a
stale value.

Signed-off-by: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:18 +0800
2344d7e44 mm: vmscan: remove all_unreclaimable() ... Browse Code »

Direct reclaim currently calls shrink_zones() to reclaim all members of
a zonelist, and if that wasn't successful it does another pass through
the same zonelist to check overall reclaimability.

Just check reclaimability in shrink_zones() directly and propagate the
result through the return value. Then remove all_unreclaimable().

Signed-off-by: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:18 +0800
0b06496a3 mm: vmscan: rework compaction-ready signaling in direct reclaim ... Browse Code »

Page reclaim for a higher-order page runs until compaction is ready,
then aborts and signals this situation through the return value of
shrink_zones(). This is an oddly specific signal to encode in the
return value of shrink_zones(), though, and can be quite confusing.

Introduce sc->compaction_ready and signal the compactability of the
zones out-of-band to free up the return value of shrink_zones() for
actual zone reclaimability.

Signed-off-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michal Hocko
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:18 +0800
8d0742931 mm: vmscan: remove remains of kswapd-managed zone->all_unreclaimable ... Browse Code »

shrink_zones() has a special branch to skip the all_unreclaimable()
check during hibernation, because a frozen kswapd can't mark a zone
unreclaimable.

But ever since commit 6e543d5780e3 ("mm: vmscan: fix
do_try_to_free_pages() livelock"), determining a zone to be
unreclaimable is done by directly looking at its scan history and no
longer relies on kswapd setting the per-zone flag.

Remove this branch and let shrink_zones() check the reclaimability of
the target zones regardless of hibernation state.

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Cc: Rik van Riel
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Acked-by: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:18 +0800

13 Jun, 2014

1 commit

16b905780 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.

In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...

Linus Torvalds
2014-06-13 01:30:18 +0800

09 Jun, 2014

1 commit

b738d7646 Don't trigger congestion wait on dirty-but-not-writeout pages ... Browse Code »
5

shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.

The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.

We'll still trigger the congestion waiting if

(a) the process is kswapd, and we hit pages flagged for immediate
reclaim

(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.

This probably needs to be revisited, but as it is this fixes a reported
regression.

Reported-by: Felipe Contreras
Pinpointed-by: Hillf Danton
Cc: Michal Hocko
Cc: Andrew Morton
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-06-09 05:17:00 +0800

07 Jun, 2014

3 commits

b1de0d139 mm: convert some level-less printks to pr_* ... Browse Code »

printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitchel Humpherys
2014-06-07 07:08:18 +0800
688eb988d vmscan: memcg: always use swappiness of the reclaimed memcg ... Browse Code »

Memory reclaim always uses swappiness of the reclaim target memcg
(origin of the memory pressure) or vm_swappiness for global memory
reclaim. This behavior was consistent (except for difference between
global and hard limit reclaim) because swappiness was enforced to be
consistent within each memcg hierarchy.

After "mm: memcontrol: remove hierarchy restrictions for swappiness and
oom_control" each memcg can have its own swappiness independent of
hierarchical parents, though, so the consistency guarantee is gone.
This can lead to an unexpected behavior. Say that a group is explicitly
configured to not swapout by memory.swappiness=0 but its memory gets
swapped out anyway when the memory pressure comes from its parent with a
It is also unexpected that the knob is meaningless without setting the
hard limit which would trigger the reclaim and enforce the swappiness.
There are setups where the hard limit is configured higher in the
hierarchy by an administrator and children groups are under control of
somebody else who is interested in the swapout behavior but not
necessarily about the memory limit.

From a semantic point of view swappiness is an attribute defining anon
vs.
file proportional scanning of LRU which is memcg specific (unlike
charges which are propagated up the hierarchy) so it should be applied
to the particular memcg's LRU regardless where the memory pressure comes
from.

This patch removes vmscan_swappiness() and stores the swappiness into
the scan_control structure. mem_cgroup_swappiness is then used to
provide the correct value before shrink_lruvec is called. The global
vm_swappiness is used for the root memcg.

[hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-06-07 07:08:17 +0800
71abdc15a mm: vmscan: clear kswapd's special reclaim powers before exiting ... Browse Code »
5

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner
Reported-by: Gu Zheng
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-06-07 07:08:06 +0800

05 Jun, 2014

9 commits

1a501907b mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY ... Browse Code »
6

Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim. The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers. Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs. This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.

Based on ext4 and using the Intel VM scalability test

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats

3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Tim Chen
Cc: Dave Chinner
Tested-by: Yuanhan Liu
Cc: Bob Liu
Cc: Jan Kara
Cc: Rik van Riel
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:12 +0800
4be89a346 mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. ... Browse Code »

Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
/ KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
use DIV_ROUND_UP macro for neater code and clear meaning.

Besides, the gap value is calculated against the per-zone "managed pages",
not "present pages". This patch also corrects the comment and do some
rephrasing.

Signed-off-by: Jianyu Zhan
Acked-by: Rik van Riel
Acked-by: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:11 +0800
b745bc85f mm: page_alloc: convert hot/cold parameter and immediate callers to bool ... Browse Code »
4

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
df9024a8c mm: shrinker: add nid to tracepoint output ... Browse Code »

Now that we are doing NUMA-aware shrinking, and can have shrinkers
running in parallel, or working on individual nodes, it seems like we
should also be sticking the node in the output.

Signed-off-by: Dave Hansen
Acked-by: Dave Chinner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:54:04 +0800
7fe704759 mm: shrinker trace points: fix negatives ... Browse Code »

I was looking at a trace of the slab shrinkers (attachment in this comment):

https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67

and noticed that "total_scan" can go negative in some cases. We
used to dump out the "total_scan" variable directly, but some of
the shrinker modifications along the way changed that.

This patch just dumps it out directly, again. It doesn't make
any sense to derive it from new_nr and nr any more since there
are now other shrinkers that can be running in parallel and
mucking with those values.

Here's an example of the negative numbers in the output:

> kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
> kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
> kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
> kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
> kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
> kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
> kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
> kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
> kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
> kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
> kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512

Signed-off-by: Dave Hansen
Acked-by: Dave Chinner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:54:04 +0800
399ba0b95 mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads ... Browse Code »

When a loopback NFS mount is active and the backing device for the NFS
mount becomes congested, that can impose throttling delays on the nfsd
threads.

These delays significantly reduce throughput and so the NFS mount remains
congested.

This results in a livelock and the reduced throughput persists.

This livelock has been found in testing with the 'wait_iff_congested'
call, and could possibly be caused by the 'congestion_wait' call.

This livelock is similar to the deadlock which justified the introduction
of PF_LESS_THROTTLE, and the same flag can be used to remove this
livelock.

To minimise the impact of the change, we still throttle nfsd when the
filesystem it is writing to is congested, but not when some separate
filesystem (e.g. the NFS filesystem) is congested.

Signed-off-by: NeilBrown
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2014-06-05 07:54:01 +0800
675becce1 mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL ... Browse Code »
15

throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.

On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:01 +0800
bfc8c9013 mem-hotplug: implement get/put_online_mems ... Browse Code »

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
6f04f48dc mm: only force scan in reclaim when none of the LRUs are big enough. ... Browse Code »

Prior to this change, we would decide whether to force scan a LRU during
reclaim if that LRU itself was too small for the current priority.
However, this can lead to the file LRU getting force scanned even if
there are a lot of anonymous pages we can reclaim, leading to hot file
pages getting needlessly reclaimed.

To address this, we instead only force scan when none of the reclaimable
LRUs are big enough.

Gives huge improvements with zswap. For example, when doing -j20 kernel
build in a 500MB container with zswap enabled, runtime (in seconds) is
greatly reduced:

x without this change
+ with this change
N Min Max Median Avg Stddev
x 5 700.997 790.076 763.928 754.05 39.59493
+ 5 141.634 197.899 155.706 161.9 21.270224
Difference at 95.0% confidence
-592.15 +/- 46.3521
-78.5293% +/- 6.14709%
(Student's t, pooled s = 31.7819)

Should also give some improvements in regular (non-zswap) swap cases.

Yes, hughd found significant speedup using regular swap, with several
memcgs under pressure; and it should also be effective in the non-memcg
case, whenever one or another zone LRU is forced too small.

Signed-off-by: Suleiman Souhlal
Signed-off-by: Hugh Dickins
Cc: Suleiman Souhlal
Cc: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Rafael Aquini
Cc: Michal Hocko
Cc: Yuanhan Liu
Cc: Seth Jennings
Cc: Bob Liu
Cc: Minchan Kim
Cc: Luigi Semenzato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suleiman Souhlal
2014-06-05 07:53:56 +0800

07 May, 2014

2 commits

8174202b3 write_iter variants of {__,}generic_file_aio_write() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-05-07 05:38:00 +0800
623762517 revert "mm: vmscan: do not swap anon pages just because free+file is low" ... Browse Code »
10

This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
just because free+file is low") because it introduced a regression in
mostly-anonymous workloads, where reclaim would become ineffective and
trap every allocating task in direct reclaim.

The problem is that there is a runaway feedback loop in the scan balance
between file and anon, where the balance tips heavily towards a tiny
thrashing file LRU and anonymous pages are no longer being looked at.
The commit in question removed the safe guard that would detect such
situations and respond with forced anonymous reclaim.

This commit was part of a series to fix premature swapping in loads with
relatively little cache, and while it made a small difference, the cure
is obviously worse than the disease. Revert it.

Signed-off-by: Johannes Weiner
Reported-by: Christian Borntraeger
Acked-by: Christian Borntraeger
Acked-by: Rafael Aquini
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-07 04:04:59 +0800

19 Apr, 2014

1 commit

83da75100 vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() ... Browse Code »
7

Seems to be called with preemption enabled. Therefore it must use
mod_zone_page_state instead.

Signed-off-by: Christoph Lameter
Reported-by: Grygorii Strashko
Tested-by: Grygorii Strashko
Cc: Tejun Heo
Cc: Santosh Shilimkar
Cc: Ingo Molnar
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-04-19 07:40:07 +0800