Eric Lee / smarc-fsl-linux-kernel

14 Oct, 2013

1 commit

009dfd441 mm: avoid reinserting isolated balloon pages into LRU lists ... Browse Code »

commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a upstream.

Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.

The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] shrink_page_list+0x24e/0x897
PGD 3cda2067 PUD 3d713067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: shrink_page_list+0x24e/0x897
RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
Call Trace:
shrink_inactive_list+0x240/0x3de
shrink_lruvec+0x3e0/0x566
__shrink_zone+0x94/0x178
shrink_zone+0x3a/0x82
balance_pgdat+0x32a/0x4c2
kswapd+0x2f0/0x372
kthread+0xa2/0xaa
ret_from_fork+0x7c/0xb0
Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
RIP [] shrink_page_list+0x24e/0x897
RSP
CR2: 0000000000000028
---[ end trace 703d2451af6ffbfd ]---
Kernel panic - not syncing: Fatal exception

This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.

[akpm@linux-foundation.org: clarify awkward comment text]
Signed-off-by: Rafael Aquini
Reported-by: Luiz Capitulino
Tested-by: Luiz Capitulino
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Rafael Aquini
2013-10-14 07:08:33 +0800

30 Apr, 2013

3 commits

5bc7b8aca mm: thp: add split tail pages to shrink page list in page reclaim ... Browse Code »

In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.

Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916

With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272

fallback allocation is reduced a lot.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Wanpeng Li
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-04-30 06:54:38 +0800
70ddf637e memcg: add memory.pressure_level events ... Browse Code »

With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov
Acked-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Tejun Heo
Cc: David Rientjes
Cc: Pekka Enberg
Cc: Mel Gorman
Cc: Glauber Costa
Cc: Michal Hocko
Cc: Luiz Capitulino
Cc: Greg Thelen
Cc: Leonid Moiseichuk
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Bartlomiej Zolnierkiewicz
Cc: John Stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Vorontsov
2013-04-30 06:54:38 +0800
2d42a40d5 mm/vmscan.c: minor cleanup for kswapd ... Browse Code »

Local variable total_scanned is no longer used.

Signed-off-by: Hillf Danton
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2013-04-30 06:54:29 +0800

18 Apr, 2013

1 commit

d72515b85 mm/vmscan: fix error return in kswapd_run() ... Browse Code »

Fix the error return value in kswapd_run(). The bug was introduced by
commit d5dc0ad928fb ("mm/vmscan: fix error number for failed kthread").

Signed-off-by: Xishi Qiu
Reviewed-by: Wanpeng Li
Reviewed-by: Rik van Riel
Reported-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-04-18 07:10:45 +0800

24 Feb, 2013

16 commits

b21e0b90c vmscan: change type of vm_total_pages to unsigned long ... Browse Code »

This variable is calculated from nr_free_pagecache_pages so
change its type to unsigned long.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-02-24 09:50:22 +0800
0e50ce3b5 mm: use up free swap space before reaching OOM kill ... Browse Code »

Recently, Luigi reported there are lots of free swap space when OOM
happens. It's easily reproduced on zram-over-swap, where many instance
of memory hogs are running and laptop_mode is enabled. He said there
was no problem when he disabled laptop_mode. The problem when I
investigate problem is following as.

Assumption for easy explanation: There are no page cache page in system
because they all are already reclaimed.

1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
2. shrink_inactive_list isolates victim pages from inactive anon lru list.
3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
pageout because sc->may_writepage is 0 so the page is rotated back into
inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
retry reclaim with higher priority.
5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
inactive anon lru list is full of dirty pages by 3 so it just returns
without any reclaim progress.
6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
Because sc->nr_scanned is increased by shrink_page_list but we don't call
shrink_page_list in 5 due to short of isolated pages.

Above loop is continued until OOM happens.

The problem didn't happen before [1] was merged because old logic's
isolatation in shrink_inactive_list was successful and tried to call
shrink_page_list to pageout them but it still ends up failed to page out
by may_writepage. But important point is that sc->nr_scanned was
increased although we couldn't swap out them so do_try_to_free_pages
could set may_writepages.

Since commit f80c0673610e ("mm: zone_reclaim: make isolate_lru_page()
filter-aware") was introduced, it's not a good idea any more to depends
on only the number of scanned pages for setting may_writepage. So this
patch adds new trigger point of setting may_writepage as below
DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
in VM so it's good fit for our purpose which would be better to lose
power saving or clickety rather than OOM killing.

Signed-off-by: Minchan Kim
Reported-by: Luigi Semenzato
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-02-24 09:50:21 +0800
e3790144c mm: refactor inactive_file_is_low() to use get_lru_size() ... Browse Code »

An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg
zone LRU list. The only difference is in how the LRU size is assessed.

get_lru_size() does the right thing for both global and memcg reclaim
situations.

Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
the numbers in common code.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:20 +0800
ec8acf20a swap: add per-partition lock for swapfile ... Browse Code »

swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD). The main contention comes
from swap_info_get(). This patch tries to fix the gap with adding a new
per-partition lock.

Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.

nr_swap_pages is an atomic now, it can be changed without swap_lock. In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages. But sounds not a big problem.

Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.

Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it. read the
flags is ok with either the locks hold.

If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.

swap_entry_free() can change swap_list. To delete that code, we add a
new highest_priority_index. Whenever get_swap_page() is called, we
check it. If it's valid, we use it.

It's a pity get_swap_page() still holds swap_lock(). But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush). And
BTW, looks get_swap_page() doesn't really need the lock. We never free
swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.

"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s. This patch further improves the
speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.

[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Greg Kroah-Hartman
Cc: Seth Jennings
Cc: Konrad Rzeszutek Wilk
Cc: Xiao Guangrong
Cc: Dan Magenheimer
Cc: Stephen Rothwell
Signed-off-by: Arnd Bergmann
Signed-off-by: Hugh Dickins
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
21caf2fc1 mm: teach mm by current context info to not do I/O during memory allocation ... Browse Code »

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)

- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate

Signed-off-by: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: Jens Axboe
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
258401a60 mm: don't wait on congested zones in balance_pgdat() ... Browse Code »

From: Zlatko Calusic

Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
many dirty pages under writeback") introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list().

What this means is that there's no more need for throttling and
additional heuristics in balance_pgdat(). So, let's remove it and tidy
up the code.

Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zlatko Calusic
2013-02-24 09:50:15 +0800
b40da0494 mm: use zone->present_pages instead of zone->managed_pages where appropriate ... Browse Code »

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:14 +0800
dafcb73e3 mm: avoid calling pgdat_balanced() needlessly ... Browse Code »

Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
check the status, then break the loop if pgdat is balanced, just to be
immediately called again. The second call is completely unnecessary, of
course.

The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.

Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zlatko Calusic
2013-02-24 09:50:10 +0800
a394cb8ee memcg,vmscan: do not break out targeted reclaim without reclaimed pages ... Browse Code »

Targeted (hard resp soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted. The reclaim is
then retried until the limit is met.

This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they
have only re-parented pages after some of their children is removed).
Those groups are reclaimed with decreasing priority pointlessly as there
is nothing to reclaim from them.

An easiest fix is to break out of the memcg iteration loop in
shrink_zone only if the whole hierarchy has been visited or sufficient
pages have been reclaimed. This is also more natural because the
reclaimer expects that the hierarchy under the given root is reclaimed.
As a result we can simplify the soft limit reclaim which does its own
iteration.

[yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
[akpm@linux-foundation.org: use conventional comparison order]
Signed-off-by: Michal Hocko
Reported-by: Ying Han
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Li Zefan
Signed-off-by: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-02-24 09:50:10 +0800
62b726c1b mm/vmscan.c:__zone_reclaim(): replace max_t() with max() ... Browse Code »

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:10 +0800
9b4f98cda mm: vmscan: compaction works against zones, not lruvecs ... Browse Code »

The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level. But this does not make sense,
because the container of interest for compaction is a zone as a whole,
not the zone pages that are part of a certain memory cgroup.

Negative impact is bounded. For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled. And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups. Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or small
group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

[akpm@linux-foundation.org: no need for min_t]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
9a2651140 mm: vmscan: clean up get_scan_count() ... Browse Code »

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which is
not necessarily the most obvious way:

fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

[akpm@linux-foundation.org: avoid using unintialized_var()]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
11d16c25b mm: vmscan: improve comment on low-page cache handling ... Browse Code »

Fix comment style and elaborate on why anonymous memory is force-scanned
when file cache runs low.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
10316b313 mm: vmscan: clarify how swappiness, highest priority, memcg interact ... Browse Code »

A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).

In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics. UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness. UNLESS file cache is running low, then anonymous
pages are force-scanned.

This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized. At least make it apparent in the code flow
and document the conditions. It will be it easier to come up with sane
semantics later.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Satoru Moriya
Reviewed-by: Michal Hocko
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
d778df51c mm: vmscan: save work scanning (almost) empty LRU lists ... Browse Code »

In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node. The number of expected empty LRU lists is thus

memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid
that.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800
7c5bd705d mm: memcg: only evict file pages when we have plenty ... Browse Code »

Commit e9868505987a ("mm, vmscan: only evict file pages when we have
plenty") makes a point of not going for anonymous memory while there is
still enough inactive cache around.

The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:

200M-memcg-defconfig-j2

vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Satoru Moriya
Cc: Simon Jeons
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-02-24 09:50:09 +0800

04 Jan, 2013

1 commit

fcb35a9ba MM: vmscan: remove __devinit attribute. ... Browse Code »

CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit from the file.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton
Cc: Andrew Morton
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Konstantin Khlebnikov
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-01-04 07:57:13 +0800

29 Dec, 2012

1 commit

ecccd1248 mm: fix null pointer dereference in wait_iff_congested() ... Browse Code »

An unintended consequence of commit 4ae0a48b5efc ("mm: modify
pgdat_balanced() so that it also handles order-0") is that
wait_iff_congested() can now be called with NULL 'struct zone *'
producing kernel oops like this:

BUG: unable to handle kernel NULL pointer dereference
IP: [] wait_iff_congested+0x59/0x140

This trivial patch fixes it.

Reported-by: Zhouping Liu
Reported-and-tested-by: Sedat Dilek
Cc: Andrew Morton
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Zlatko Calusic
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-29 00:42:39 +0800

24 Dec, 2012

1 commit

4ae0a48b5 mm: modify pgdat_balanced() so that it also handles order-0 ... Browse Code »

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Suggested-by: Andrew Morton
Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-24 01:46:36 +0800

20 Dec, 2012

1 commit

cda73a10e mm: do not sleep in balance_pgdat if there's no i/o congestion ... Browse Code »

On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
the Normal zone gets fragmented in time. This requires relatively more
pressure in balance_pgdat to get the zone above the required watermark.
Unfortunately, the congestion_wait() call in there slows it down for a
completely wrong reason, expecting that there's a lot of
writeback/swapout, even when there's none (much more common). After a
few days, when fragmentation progresses, this flawed logic translates to
a very high CPU iowait times, even though there's no I/O congestion at
all. If THP is enabled, the problem occurs sooner, but I was able to
see it even on !THP kernels, just by giving it a bit more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which nicely
revealed the problem after only 1 day of uptime, and it's been working
great.

Signed-off-by: Zlatko Calusic
Acked-by: Mel Gorman
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-20 23:06:56 +0800

19 Dec, 2012

2 commits

3cf23841b mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() ... Browse Code »

Neil found that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim. If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular deadlock.

some task enters direct reclaim with GFP_KERNEL
=> too_many_isolated() false
=> vmscan and run into dirty pages
=> pageout()
=> take some FS lock
=> fs/block code does GFP_NOIO allocation
=> enter direct reclaim again
=> too_many_isolated() true
=> waiting for others to progress, however the other
tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by lowering the throttle threshold for the
latter.

Allowing ~1/8 isolated pages in normal is large enough. For example, for
a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
logical CPUs per NUMA node). So it's not likely some CPU goes idle
waiting (when it could make progress) because of this limit: there are
much more sleeping reclaim tasks than the number of CPU, so the task may
well be blocked by some low level queue/lock anyway.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
They will be blocked only when there are too many concurrent !GFP_IOFS
reclaims, however that's very unlikely because the IO-less direct reclaims
is able to progress much more faster, and they won't deadlock each other.
The threshold is raised high enough for them, so that there can be
sufficient parallel progress of !GFP_IOFS reclaims.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Wu Fengguang
Cc: Torsten Kaiser
Tested-by: NeilBrown
Reviewed-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-12-19 07:02:15 +0800
d37dd5dcb vmscan: comment too_many_isolated() ... Browse Code »

Comment "Why it's doing so" rather than "What it does" as proposed by
Andrew Morton.

Signed-off-by: Wu Fengguang
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-12-19 07:02:15 +0800

13 Dec, 2012

1 commit

48fb2e240 vmscan: use N_MEMORY instead N_HIGH_MEMORY ... Browse Code »

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan
Acked-by: Hillf Danton
Signed-off-by: Wen Congyang
Cc: Christoph Lameter
Cc: Hillf Danton
Cc: Lin Feng
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2012-12-13 09:38:33 +0800

12 Dec, 2012

3 commits

6f6313d48 mm/vmscan.c: try_to_freeze() returns boolean ... Browse Code »

kswapd()->try_to_freeze() is defined to return a boolean, so it's better
to use a bool to hold its return value.

Signed-off-by: Jie Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Liu
2012-12-12 09:22:27 +0800
e98685059 mm,vmscan: only evict file pages when we have plenty ... Browse Code »

If we have more inactive file pages than active file pages, we skip
scanning the active file pages altogether, with the idea that we do not
want to evict the working set when there is plenty of streaming IO in the
cache.

However, the code forgot to also skip scanning anonymous pages in that
situation. That leads to the curious situation of keeping the active file
pages protected from being paged out when there are lots of inactive file
pages, while still scanning and evicting anonymous pages.

This patch fixes that situation, by only evicting file pages when we have
plenty of them and most are inactive.

[akpm@linux-foundation.org: adjust comment layout]
Signed-off-by: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-12-12 09:22:23 +0800
d84da3f9e mm: use IS_ENABLED(CONFIG_COMPACTION) instead of COMPACTION_BUILD ... Browse Code »

We don't need custom COMPACTION_BUILD anymore, since we have handy
IS_ENABLED().

Signed-off-by: Kirill A. Shutemov
Acked-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-12-12 09:22:22 +0800

09 Dec, 2012

1 commit

ed23ec4f0 mm: vmscan: fix inappropriate zone congestion clearing ... Browse Code »

commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
to individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion clearing,
which now happens unconditionally on higher order reclaim.

This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.

Remove the clearing from the zone compaction section entirely. The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-12-09 00:41:18 +0800

07 Dec, 2012

1 commit

c702418f8 mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones ... Browse Code »

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the node's
memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all bigger
zones in the node had plenty of free memory. Arguably, the same should
apply to compaction: if a significant part of the node is balanced
enough to run compaction, do not get hung up on that tiny zone that
might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced). Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

See for example

https://bugzilla.redhat.com/show_bug.cgi?id=866988

Signed-off-by: Johannes Weiner
Reported-and-tested-by: Thorsten Leemhuis
Reported-by: Jiri Slaby
Tested-by: John Ellson
Tested-by: Zdenek Kabelac
Tested-by: Bruno Wolff III
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-12-07 03:29:57 +0800

01 Dec, 2012

1 commit

60cefed48 mm: vmscan: fix endless loop in kswapd balancing ... Browse Code »

Kswapd does not in all places have the same criteria for a balanced
zone. Zones are only being reclaimed when their high watermark is
breached, but compaction checks loop over the zonelist again when the
zone does not meet the low watermark plus two times the size of the
allocation. This gets kswapd stuck in an endless loop over a small
zone, like the DMA zone, where the high watermark is smaller than the
compaction requirement.

Add a function, zone_balanced(), that checks the watermark, and, for
higher order allocations, if compaction has enough free memory. Then
use it uniformly to check for balanced zones.

This makes sure that when the compaction watermark is not met, at least
reclaim happens and progress is made - or the zone is declared
unreclaimable at some point and skipped entirely.

Signed-off-by: Johannes Weiner
Reported-by: George Spelvin
Reported-by: Johannes Hirte
Reported-by: Tomas Racek
Tested-by: Johannes Hirte
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-12-01 00:51:17 +0800

27 Nov, 2012

1 commit

50694c28f mm: vmscan: check for fatal signals iff the process was throttled ... Browse Code »

Commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC
reserves are low and swap is backed by network storage") introduced a
check for fatal signals after a process gets throttled for network
storage. The intention was that if a process was throttled and got
killed that it should not trigger the OOM killer. As pointed out by
Minchan Kim and David Rientjes, this check is in the wrong place and too
broad. If a system is in am OOM situation and a process is exiting, it
can loop in __alloc_pages_slowpath() and calling direct reclaim in a
loop. As the fatal signal is pending it returns 1 as if it is making
forward progress and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs. If the process is killed
while throttled, it will return immediately without direct reclaim
except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
reserves.

Minchan pointed out that it may be better to direct reclaim before
returning to avoid using the reserves because there may be pages that
can easily reclaim that would avoid using the reserves. However, we do
no such targetted reclaim and there is no guarantee that suitable pages
are available. As it is expected that this throttling happens when
swap-over-NFS is used there is a possibility that the process will
instead swap which may allocate network buffers from the PFMEMALLOC
reserves. Hence, in the swap-over-nfs case where a process can be
throtted and be killed it can use the reserves to exit or it can
potentially use reserves to swap a few pages and then exit. This patch
takes the option of using the reserves if necessary to allow the process
exit quickly.

If this patch passes review it should be considered a -stable candidate
for 3.6.

Signed-off-by: Mel Gorman
Cc: David Rientjes
Cc: Luigi Semenzato
Cc: Dan Magenheimer
Cc: KOSAKI Motohiro
Cc: Sonny Rao
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-11-27 09:41:24 +0800

17 Nov, 2012

1 commit

96710098e mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" ... Browse Code »

Jiri Slaby reported the following:

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".) Given kswapd
had hours of runtime in ps/top output yesterday in the morning
and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a sane
compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
each time and continues reclaiming which was not taken into account when
the patch was developed.

Attempts to address the problem ended up just changing the shape of the
problem instead of fixing it. The release window gets closer and while
a THP allocation failing is not a major problem, kswapd chewing up a lot
of CPU is.

This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of
pages reclaimed by reclaim/compaction based on failures") and will be
revisited in the future.

Signed-off-by: Mel Gorman
Cc: Zdenek Kabelac
Tested-by: Valdis Kletnieks
Cc: Jiri Slaby
Cc: Rik van Riel
Cc: Jiri Slaby
Cc: Johannes Hirte
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-11-17 06:33:04 +0800

09 Nov, 2012

1 commit

b0a8cc58e mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() ... Browse Code »

In kswapd(), set current->reclaim_state to NULL before returning, as
current->reclaim_state holds reference to variable on kswapd()'s stack.

In rare cases, while returning from kswapd() during memory offlining,
__free_slab() and freepages() can access the dangling pointer of
current->reclaim_state.

Signed-off-by: Takamori Yamaguchi
Signed-off-by: Aaditya Kumar
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Takamori Yamaguchi
2012-11-09 13:41:47 +0800

09 Oct, 2012

3 commits

e46a28790 CMA: migrate mlocked pages ... Browse Code »

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:00 +0800
39b5f29ac mm: remove vma arg from page_evictable ... Browse Code »

page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
62997027c mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity ... Browse Code »

Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:51 +0800