01 Nov, 2011
40 commits
-
A process spent 30 minutes exiting, just munlocking the pages of a large
anonymous area that had been alternately mprotected into page-sized vmas:
for every single page there's an anon_vma walk through all the other
little vmas to find the right one.A general fix to that would be a lot more complicated (use prio_tree on
anon_vma?), but there's one very simple thing we can do to speed up the
common case: if a page to be munlocked is mapped only once, then it is our
vma that it is mapped into, and there's no need whatever to walk through
all the others.Okay, there is a very remote race in munlock_vma_pages_range(), if between
its follow_page() and lock_page(), another process were to munlock the
same page, then page reclaim remove it from our vma, then another process
mlock it again. We would find it with page_mapcount 1, yet it's still
mlocked in another process. But never mind, that's much less likely than
the down_read_trylock() failure which munlocking already tolerates (in
try_to_unmap_one()): in due course page reclaim will discover and move the
page to unevictable instead.[akpm@linux-foundation.org: add comment]
Signed-off-by: Hugh Dickins
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are three cases of update_mmu_cache() in the file, and the case in
function collapse_huge_page() has a typo, namely the last parameter used,
which is corrected based on the other two cases.Due to the define of update_mmu_cache by X86, the only arch that
implements THP currently, the change here has no really crystal point, but
one or two minutes of efforts could be saved for those archs that are
likely to support THP in future.Signed-off-by: Hillf Danton
Cc: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The THP copy-on-write handler falls back to regular-sized pages for a huge
page replacement upon allocation failure or if THP has been individually
disabled in the target VMA. The loop responsible for copying page-sized
chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
the virtual address arg for copy_user_highpage().Signed-off-by: Hillf Danton
Acked-by: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MCL_FUTURE does not move pages between lru list and draining the LRU per
cpu pagevecs is a nasty activity. Avoid doing it unecessarily.Signed-off-by: Christoph Lameter
Cc: David Rientjes
Reviewed-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps. This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Cc: Josh Boyer
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a race between putback_lru_page() and shmem_lock with lock=0 happens,
progrom execution order is as follows, but clear_bit in processor #1 could
be reordered right before spin_unlock of processor #1. Then, the page
would be stranded on the unevictable list.spin_lock
SetPageLRU
spin_unlock
clear_bit(AS_UNEVICTABLE)
spin_lock
if PageLRU()
if !test_bit(AS_UNEVICTABLE)
move evictable list
smp_mb
if !test_bit(AS_UNEVICTABLE)
move evictable list
spin_unlockBut, pagevec_lookup() in scan_mapping_unevictable_pages() has
rcu_read_[un]lock() so it could protect reordering before reaching
test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
But it's a unexpected side effect and we should solve this problem
properly.This patch adds a barrier after mapping_clear_unevictable.
I didn't meet this problem but just found during review.
Signed-off-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Lee Schermerhorn
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Quiet the sparse noise:
warning: symbol 'khugepaged_scan' was not declared. Should it be static?
warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlockSigned-off-by: H Hartley Sweeten
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Quiet the spares noise:
warning: symbol 'default_policy' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten
Cc: KOSAKI Motohiro
Cc: Stephen Wilson
Cc: Andrea Arcangeli
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Quiet the following sparse noise:
warning: symbol 'swap_token_memcg' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten
Cc: Rik van Riel
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Quiet the following sparse noise in this file:
warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten
Cc: Yinghai Lu
Cc: "H. Peter Anvin"
Cc: Benjamin Herrenschmidt
Cc: Tomi Valkeinen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
At one point, anonymous pages were supposed to go on the unevictable list
when no swap space was configured, and the idea was to manually rescue
those pages after adding swap and making them evictable again. But
nowadays, swap-backed pages on the anon LRU list are not scanned without
available swap space anyway, so there is no point in moving them to a
separate list anymore.The manual rescue could also be used in case pages were stranded on the
unevictable list due to race conditions. But the code has been around for
a while now and newly discovered bugs should be properly reported and
dealt with instead of relying on such a manual fixup.In addition to the lack of a usecase, the sysfs interface to rescue pages
from a specific NUMA node has been broken since its introduction, so it's
unlikely that anybody ever relied on that.This patch removes the functionality behind the sysctl and the
node-interface and emits a one-time warning when somebody tries to access
either of them.Signed-off-by: Johannes Weiner
Reported-by: Kautuk Consul
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
write_scan_unevictable_node() checks the value req returned by
strict_strtoul() and returns 1 if req is 0.However, when strict_strtoul() returns 0, it means successful conversion
of buf to unsigned long.Due to this, the function was not proceeding to scan the zones for
unevictable pages even though we write a valid value to the
scan_unevictable_pages sys file.Change this check slightly to check for invalid value in buf as well as 0
value stored in res after successful conversion via strict_strtoul. In
both cases, we do not perform the scanning of this node's zones.Signed-off-by: Kautuk Consul
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Li Haifeng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There's no compact_zone_order() user outside file scope, so make it static.
Signed-off-by: Kyungmin Park
Acked-by: David Rientjes
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
by Andi Kleen converted a number of pr_debug()s to pr_info()s.About the same time additional code with pr_debug()s was added by two
other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
soft offlining for hugepage"). And these pr_debug()s failed to get
converted to pr_info()s.This patch converts them as well. And does some minor related whitespace
cleanup.Signed-off-by: Dean Nelson
Reviewed-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On the ext4 mailing list[1], we got some report about errors in
__find_get_block_slow(), but the information is very limited.If the device information is given, we can know the name of the sick
volume. Futhermore, we can get the corresponding status of that
block(group, inode block etc) by analyzing the disk layout.[1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2
Signed-off-by: Tao Ma
Cc: Al Viro
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The ret variable is really not needed in mm_take_all_locks().
Signed-off-by: Kautuk Consul
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.Signed-off-by: Mikulas Patocka
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fiddle wording
Cc: Jan Kara
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
try_to_unmap_one() is called by try_to_unmap_ksm(), too.
Signed-off-by: Wanlong Gao
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some vmalloc failure paths do not report OOM conditions.
Add warn_alloc_failed, which also does a dump_stack, to those failure
paths.This allows more site specific vmalloc failure logging message printks to
be removed.Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There 2 places to read pgdat in kswapd. One is return from a successful
balance, another is waked up from kswapd sleeping. The new_order and
new_classzone_idx represent the balance input order and classzone_idx.But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.1: after a successful balance, kswapd goes to sleep, and new_order = 0;
new_classzone_idx = __MAX_NR_ZONES - 1;2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
3: in the balance_pgdat() running, a new balance wakeup happened with
order = 5, and classzone_idx = ZONE_NORMAL4: the first wakeup(order = 3) finished successufly, return order = 3
but, the new_order is still 0, so, this balancing will be treated as a
failed balance. And then the second tighter balancing will be missed.So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.Signed-off-by: Alex Shi
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Tested-by: Pádraig Brady
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
warning: function 'memblock_memory_can_coalesce'
with external linkage has definition.Signed-off-by: Jonghwan Choi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing. But in the following scenario, kswapd do something that
is not matched our expectation. The patch fixes this issue.1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
While the expectation behavior of kswapd should try to sleep.Signed-off-by: Alex Shi
Reviewed-by: Tim Chen
Acked-by: Mel Gorman
Tested-by: Pádraig Brady
Cc: Rik van Riel
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This adds support for highmem pages poisoning and verification to the
debug-pagealloc feature for no-architecture support.[akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On NOMMU architectures, if physical memory doesn't start from 0,
ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
Because virtual address is equal to physical address, PAGE_OFFSET is
always 0. virt_to_page and page_to_virt should not index page by
PAGE_OFFSET directly.Signed-off-by: Sonic Zhang
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This adds THP support to mremap (decreases the number of split_huge_page()
calls).Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include
#include
#include
#include
#include#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");return 0;
}
===THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):THP on
usec 14824
usec 14862
usec 14859THP off
usec 256416
usec 255981
usec 255847With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).THP on
usec 392107
usec 390237
usec 404124THP off
usec 444294
usec 445237
usec 445820I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.All debug options are off except DEBUG_VM to avoid skewing the
results.The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things. It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey). If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls. Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs. One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check. This I feel
will improve readability.Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With the NO_BOOTMEM symbol added architectures may now use the following
syntax to tell that they do not need bootmem:select NO_BOOTMEM
This is much more convinient than adding a new kconfig symbol which was
otherwise required.Adding this symbol does not conflict with the architctures that already
define their own symbol.Signed-off-by: Sam Ravnborg
Cc: Yinghai Lu
Acked-by: Tejun Heo
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SPARC32 require access to the start address. Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.The awkward name was chosen to match the already present
memblock_end_of_DRAM().Signed-off-by: Sam Ravnborg
Cc: "David S. Miller"
Cc: Yinghai Lu
Acked-by: Tejun Heo
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.Signed-off-by: Mitsuo Hayasaka
Cc: Andrew Morton
Cc: David Rientjes
Cc: Namhyung Kim
Cc: "Paul E. McKenney"
Cc: Jeremy Fitzhardinge
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use newly introduced memchr_inv() for page verification.
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.The function name and prototype are stolen from logfs and the
implementation is from SLUB.Signed-off-by: Akinobu Mita
Acked-by: Christoph Lameter
Acked-by: Pekka Enberg
Cc: Matt Mackall
Acked-by: Joern Engel
Cc: Marcin Slusarz
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
printk_ratelimit() should not be used, because it shares ratelimiting
state with all other unrelated printk_ratelimit() callsites.Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It's possible a zone watermark is ok when entering the balance_pgdat()
loop, while the zone is within the requested classzone_idx. Count pages
from this zone into `balanced'. In this way, we can skip shrinking zones
too much for high order allocation.Signed-off-by: Shaohua Li
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Workloads that are allocating frequently and writing files place a large
number of dirty pages on the LRU. With use-once logic, it is possible for
them to reach the end of the LRU quickly requiring the reclaimer to scan
more to find clean pages. Ordinarily, processes that are dirtying memory
will get throttled by dirty balancing but this is a global heuristic and
does not take into account that LRUs are maintained on a per-zone basis.
This can lead to a situation whereby reclaim is scanning heavily, skipping
over a large number of pages under writeback and recycling them around the
LRU consuming CPU.This patch checks how many of the number of pages isolated from the LRU
were dirty and under writeback. If a percentage of them under writeback,
the process will be throttled if a backing device or the zone is
congested. Note that this applies whether it is anonymous or file-backed
pages that are under writeback meaning that swapping is potentially
throttled. This is intentional due to the fact if the swap device is
congested, scanning more pages and dispatching more IO is not going to
help matters.The percentage that must be in writeback depends on the priority. At
default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
greater the likelihood the process will get throttled to allow the flusher
threads to make some progress.Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds