Eric Lee / smarc-fsl-linux-kernel

15 Aug, 2015

3 commits

7f6bf39bb mm/hwpoison: fix panic due to split huge zero page ... Browse Code »

Bug:

------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:1957!
invalid opcode: 0000 [#1] SMP
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
RIP: split_huge_page_to_list+0xdb/0x120
Call Trace:
memory_failure+0x32e/0x7c0
madvise_hwpoison+0x8b/0x160
SyS_madvise+0x40/0x240
? do_page_fault+0x37/0x90
entry_SYSCALL_64_fastpath+0x12/0x71
Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
RIP split_huge_page_to_list+0xdb/0x120
RSP
---[ end trace aee7ce0df8e44076 ]---

Testcase:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include

#define MB 1024*1024

int main(void)
{
char *mem;

posix_memalign((void **)&mem, 2 * MB, 200 * MB);

madvise(mem, 200 * MB, MADV_HWPOISON);

free(mem);

return 0;
}

Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
The get_user_pages_fast() which called in madvise_hwpoison() will get
huge zero page if the page is not allocated before. Huge zero page is a
tranparent huge page, however, it is not an anonymous page.
memory_failure will split the huge zero page and trigger
BUG_ON(is_huge_zero_page(page));

After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling
for non-tail-refcounted thp"), memory_failure will not catch non anon
thp from madvise_hwpoison path and this bug occur.

Fix it by catching non anon thp in memory_failure in order to not split
huge zero page in madvise_hwpoison path.

After this patch:

Injecting memory failure for page 0x202800 at 0x7fd8ae800000
MCE: 0x202800: non anonymous thp
[...]

[akpm@linux-foundation.org: remove second split, per Wanpeng]
Signed-off-by: Wanpeng Li
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton

Signed-off-by: Linus Torvalds

Wanpeng Li
2015-08-15 06:56:32 +0800
036138080 mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held ... Browse Code »

Hugetlbfs pages will get a refcount in get_any_page() or
madvise_hwpoison() if soft offlining through madvise. The refcount which
is held by the soft offline path should be released if we fail to isolate
hugetlbfs pages.

Fix it by reducing the refcount for both isolation success and failure.

Signed-off-by: Wanpeng Li
Acked-by: Naoya Horiguchi
Cc: [3.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-08-15 06:56:32 +0800
4f32be677 mm/hwpoison: fix page refcount of unknown non LRU page ... Browse Code »

After trying to drain pages from pagevec/pageset, we try to get reference
count of the page again, however, the reference count of the page is not
reduced if the page is still not on LRU list.

Fix it by adding the put_page() to drop the page reference which is from
__get_any_page().

Signed-off-by: Wanpeng Li
Acked-by: Naoya Horiguchi
Cc: [3.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-08-15 06:56:32 +0800

07 Aug, 2015

4 commits

4491f7126 mm/memory-failure: set PageHWPoison before migrate_pages() ... Browse Code »

Now page freeing code doesn't consider PageHWPoison as a bad page, so by
setting it before completing the page containment, we can prevent the
error page from being reused just after successful page migration.

I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
page table entry is transformed into migration entry, not to hwpoison
entry.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-08-07 09:39:42 +0800
98ed2b005 mm/memory-failure: give up error handling for non-tail-refcounted thp ... Browse Code »

"non anonymous thp" case is still racy with freeing thp, which causes
panic due to put_page() for refcount-0 page. It seems that closing up
this race might be hard (and/or not worth doing,) so let's give up the
error handling for this case.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-08-07 09:39:41 +0800
a209ef09a mm/memory-failure: fix race in counting num_poisoned_pages ... Browse Code »

When memory_failure() is called on a page which are just freed after
page migration from soft offlining, the counter num_poisoned_pages is
raised twi= ce. So let's fix it with using TestSetPageHWPoison.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-08-07 09:39:41 +0800
a09233f3e mm/memory-failure: unlock_page before put_page ... Browse Code »

Recently I addressed a few of hwpoison race problems and the patches are
merged on v4.2-rc1. It made progress, but unfortunately some problems
still remain due to less coverage of my testing. So I'm trying to fix
or avoid them in this series.

One point I'm expecting to discuss is that patch 4/5 changes the page
flag set to be checked on free time. In current behavior, __PG_HWPOISON
is not supposed to be set when the page is freed. I think that there is
no strong reason for this behavior, and it causes a problem hard to fix
only in error handler side (because __PG_HWPOISON could be set at
arbitrary timing.) So I suggest to change it.

With this patchset, hwpoison stress testing in official mce-test
testsuite (which previously failed) passes.

This patch (of 5):

In "just unpoisoned" path, we do put_page and then unlock_page, which is
a wrong order and causes "freeing locked page" bug. So let's fix it.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-08-07 09:39:41 +0800

25 Jun, 2015

9 commits

97f0b1345 tracing: add trace event for memory-failure ... Browse Code »

RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event. So, I want
to add this event to make this scenario complete.

This patch add a event at ras group for memory-failure.

The output like below:
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:24
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi
Reviewed-by: Naoya Horiguchi
Acked-by: Steven Rostedt
Cc: Tony Luck
Cc: Chen Gong
Cc: Jim Davis
Signed-off-by: Xie XiuQi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
cc3e2af42 memory-failure: change type of action_result's param 3 to enum ... Browse Code »

Change type of action_result's param 3 to enum for type consistency,
and rename mf_outcome to mf_result for clearly.

Signed-off-by: Xie XiuQi
Acked-by: Naoya Horiguchi
Cc: Chen Gong
Cc: Jim Davis
Cc: Steven Rostedt
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
cc637b170 memory-failure: export page_type and action result ... Browse Code »

Export 'outcome' and 'action_page_type' to mm.h, so we could use
this emnus outside.

This patch is preparation for adding trace events for memory-failure
recovery action.

Signed-off-by: Xie XiuQi
Acked-by: Naoya Horiguchi
Cc: Chen Gong
Cc: Jim Davis
Cc: Steven Rostedt
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
2491ffee9 mm/memory-failure: me_huge_page() does nothing for thp ... Browse Code »

memory_failure() is supposed not to handle thp itself, but to split it.
But if something were wrong and page_action() were called on thp,
me_huge_page() (action routine for hugepages) should be better to take
no action, rather than to take wrong action prepared for hugetlb (which
triggers BUG_ON().)

This change is for potential problems, but makes sense to me because thp
is an actively developing feature and this code path can be open in the
future.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
add05cece mm: soft-offline: don't free target page in successful page migration ... Browse Code »

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [] free_pcppages_bulk+0x52a/0x6f0
RSP
---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
ead07f6a8 mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling ... Browse Code »

memory_failure() can run in 2 different mode (specified by
MF_COUNT_INCREASED) in page refcount perspective. When
MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
takes a refcount of the target page. And if cleared, memory_failure()
takes it in it's own.

In current code, however, refcounting is done differently in each caller.
For example, madvise_hwpoison() uses get_user_pages_fast() and
hwpoison_inject() uses get_page_unless_zero(). So this inconsistent
refcounting causes refcount failure especially for thp tail pages.
Typical user visible effects are like memory leak or
VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page()
to handle thp tail pages in the same manner for each caller of hwpoison
code.

memory_failure() might fail to split thp and in such case it returns
without completing page isolation. This is not good because PageHWPoison
on the thp is still set and there's no easy way to unpoison such thps. So
this patch try to roll back any action to the thp in "non anonymous thp"
case and "thp split failed" case, expecting an MCE(SRAR) generated by
later access afterward will properly free such thps.

[akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
415c64c14 mm/memory-failure: split thp earlier in memory error handling ... Browse Code »

memory_failure() doesn't handle thp itself at this time and need to split
it before doing isolation. Currently thp is split in the middle of
hwpoison_user_mappings(), but there're corner cases where memory_failure()
wrongly tries to handle thp without splitting.

1) "non anonymous" thp, which is not a normal operating mode of thp,
but a memory error could hit a thp before anon_vma is initialized. In
such case, split_huge_page() fails and me_huge_page() (intended for
hugetlb) is called for thp, which triggers BUG_ON in page_hstate().

2) !PageLRU case, where hwpoison_user_mappings() returns with
SWAP_SUCCESS and the result is the same as case 1.

memory_failure() can't avoid splitting, so let's split it more earlier,
which also reduces code which are prepared for both of normal page and
thp.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
ebb09738d mm, hwpoison: remove obsolete "Notebook" todo list ... Browse Code »

All the items mentioned here have been either addressed, or were not
really needed. So just remove the comment.

Signed-off-by: Andi Kleen
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2015-06-25 08:49:41 +0800
e0de78dfb mm, hwpoison: add comment describing when to add new cases ... Browse Code »

Here's another comment fix for hwpoison.

It describes the "guiding principle" on when to add new
memory error recovery code.

Signed-off-by: Andi Kleen
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2015-06-25 08:49:41 +0800

06 May, 2015

2 commits

602498f9a mm: soft-offline: fix num_poisoned_pages counting on concurrent events ... Browse Code »

If multiple soft offline events hit one free page/hugepage concurrently,
soft_offline_page() can handle the free page/hugepage multiple times,
which makes num_poisoned_pages counter increased more than once. This
patch fixes this wrong counting by checking TestSetPageHWPoison for normal
papes and by checking the return value of dequeue_hwpoisoned_huge_page()
for hugepages.

Signed-off-by: Naoya Horiguchi
Acked-by: Dean Nelson
Cc: Andi Kleen
Cc: [3.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:10 +0800
09789e5de mm/memory-failure: call shake_page() when error hits thp tail page ... Browse Code »

Currently memory_failure() calls shake_page() to sweep pages out from
pcplists only when the victim page is 4kB LRU page or thp head page.
But we should do this for a thp tail page too.

Consider that a memory error hits a thp tail page whose head page is on
a pcplist when memory_failure() runs. Then, the current kernel skips
shake_pages() part, so hwpoison_user_mappings() returns without calling
split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
still cleared due to the skip of shake_page().

As a result, me_huge_page() runs for the thp, which is broken behavior.

One effect is a leak of the thp. And another is to fail to isolate the
memory error, so later access to the error address causes another MCE,
which kills the processes which used the thp.

This patch fixes this problem by calling shake_page() for thp tail case.

Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
Signed-off-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Acked-by: Dean Nelson
Cc: Andrea Arcangeli
Cc: Hidetoshi Seto
Cc: Jin Dongming
Cc: [3.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-05-06 08:10:10 +0800

16 Apr, 2015

2 commits

bcc542223 mm: hugetlb: introduce page_huge_active ... Browse Code »

We are not safe from calling isolate_huge_page() on a hugepage
concurrently, which can make the victim hugepage in invalid state and
results in BUG_ON().

The root problem of this is that we don't have any information on struct
page (so easily accessible) about hugepages' activeness. Note that
hugepages' activeness means just being linked to
hstate->hugepage_activelist, which is not the same as normal pages'
activeness represented by PageActive flag.

Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
before isolation, so let's do similarly for hugetlb with a new
paeg_huge_active().

set/clear_page_huge_active() should be called within hugetlb_lock. But
hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
in these functions set_page_huge_active() is called right after the
hugepage is allocated and no other thread tries to isolate it.

[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
[fengguang.wu@intel.com: set_page_huge_active() can be static]
Signed-off-by: Naoya Horiguchi
Cc: Hugh Dickins
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-04-16 07:35:19 +0800
64d37a2ba mm/memory-failure.c: define page types for action_result() in one place ... Browse Code »

This cleanup patch moves all strings passed to action_result() into a
singl= e array action_page_type so that a reader can easily find which
kind of actio= n results are possible. And this patch also fixes the
odd lines to be printed out, like "unknown page state page" or "free
buddy, 2nd try page".

[akpm@linux-foundation.org: rename messages, per David]
[akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Cc: Tony Luck
Cc: "Xie XiuQi"
Cc: Steven Rostedt
Cc: Chen Gong
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-04-16 07:35:16 +0800

13 Feb, 2015

2 commits

9ab3b598d mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() ... Browse Code »

A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.

This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.

The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.

Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: Chen Gong
Cc: [3.11+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-13 10:54:11 +0800
cb731d6c6 vmscan: per memory cgroup slab shrinkers ... Browse Code »

This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
set, it will be called per memory cgroup. The memory cgroup to scan
objects from is passed in shrink_control->memcg. If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list. Unaware shrinkers are only called on global pressure with
memcg=NULL.

Signed-off-by: Vladimir Davydov
Cc: Dave Chinner
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Glauber Costa
Cc: Alexander Viro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:09 +0800

14 Dec, 2014

3 commits

6b4f7799c mm: vmscan: invoke slab shrinkers from shrink_zone() ... Browse Code »

The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.

For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.

[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner
Cc: Dave Chinner
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-14 04:42:48 +0800
d28eb9c86 mm/memory-failure: share the i_mmap_rwsem ... Browse Code »

No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.

Signed-off-by: Davidlohr Bueso
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800
83cde9e8b mm: use new helper functions around the i_mmap_mutex ... Browse Code »

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso
Acked-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800

11 Dec, 2014

3 commits

b6da0076b Merge branch 'akpm' (patchbomb from Andrew) ... Browse Code »

Merge first patchbomb from Andrew Morton:
- a few minor cifs fixes
- dma-debug upadtes
- ocfs2
- slab
- about half of MM
- procfs
- kernel/exit.c
- panic.c tweaks
- printk upates
- lib/ updates
- checkpatch updates
- fs/binfmt updates
- the drivers/rtc tree
- nilfs
- kmod fixes
- more kernel/exit.c
- various other misc tweaks and fixes

* emailed patches from Andrew Morton : (190 commits)
exit: pidns: fix/update the comments in zap_pid_ns_processes()
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
exit: exit_notify: re-use "dead" list to autoreap current
exit: reparent: call forget_original_parent() under tasklist_lock
exit: reparent: avoid find_new_reaper() if no children
exit: reparent: introduce find_alive_thread()
exit: reparent: introduce find_child_reaper()
exit: reparent: document the ->has_child_subreaper checks
exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
exit: proc: don't try to flush /proc/tgid/task/tgid
exit: release_task: fix the comment about group leader accounting
exit: wait: drop tasklist_lock before psig->c* accounting
exit: wait: don't use zombie->real_parent
exit: wait: cleanup the ptrace_reparented() checks
usermodehelper: kill the kmod_thread_locker logic
usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
...

Linus Torvalds
2014-12-11 10:34:42 +0800
c05543293 mm, memory_hotplug/failure: drain single zone pcplists ... Browse Code »

Memory hotplug and failure mechanisms have several places where pcplists
are drained so that pages are returned to the buddy allocator and can be
e.g. prepared for offlining. This is always done in the context of a
single zone, we can reduce the pcplists drain to the single zone, which
is now possible.

The change should make memory offlining due to hotremove or failure
faster and not disturbing unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Yasuaki Ishimatsu
Cc: Zhang Yanfei
Cc: Xishi Qiu
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-12-11 09:41:05 +0800
93481ff0e mm: introduce single zone pcplists drain ... Browse Code »

The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.

This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.

All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.

Signed-off-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Yasuaki Ishimatsu
Cc: Zhang Yanfei
Cc: Xishi Qiu
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-12-11 09:41:05 +0800

22 Oct, 2014

1 commit

6dc52cbe0 RAS, HWPOISON: Fix wrong error recovery status ... Browse Code »

When Uncorrected error happens, if the poisoned page is referenced
by more than one user after error recovery, the recovery is not
successful. But currently the display result is wrong.
Before this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

After this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Failed
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

Signed-off-by: Chen, Gong
Link: http://lkml.kernel.org/r/1406530260-26078-3-git-send-email-gong.chen@linux.intel.com
Acked-by: Naoya Horiguchi
Acked-by: Tony Luck
Signed-off-by: Borislav Petkov

Chen, Gong
2014-10-22 04:06:50 +0800

19 Sep, 2014

1 commit

f29374b14 cgroup: remove redundant check in cgroup_ino() ... Browse Code »

After we implemented default unified hierarchy, cgrp->kn can never
be NULL.

Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo

Zefan Li
2014-09-19 21:16:23 +0800

07 Aug, 2014

1 commit

f37d4298a hwpoison: fix race with changing page during offlining ... Browse Code »

When a hwpoison page is locked it could change state due to parallel
modifications. The original compound page can be torn down and then
this 4k page becomes part of a differently-size compound page is is a
standalone regular page.

Check after the lock if the page is still the same compound page.

We could go back, grab the new head page and try again but it should be
quite rare, so I thought this was safest. A retry loop would be more
difficult to test and may have more side effects.

The hwpoison code by design only tries to handle cases that are
reasonably common in workloads, as visible in page-flags.

I'm not really that concerned about handling this (likely rare case),
just not crashing on it.

Signed-off-by: Andi Kleen
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2014-08-07 09:01:19 +0800

31 Jul, 2014

2 commits

52089b14c hwpoison: call action_result() in failure path of hwpoison_user_mappings() ... Browse Code »

hwpoison_user_mappings() could fail for various reasons, so printk()s to
print out the reasons should be done in each failure check inside
hwpoison_user_mappings().

And currently we don't call action_result() when hwpoison_user_mappings()
fails, which is not consistent with other exit points of memory error
handler. So this patch fixes these messaging problems.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Chen Yucong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-07-31 08:16:13 +0800
93a9eb39f hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings() ... Browse Code »

A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
handling path of the victimized page frame that belong to non-LRU")
rejects going into unmapping operation for hugetlbfs/thp pages, which
results in failing error containing on such pages. This patch fixes it.

With this patch, hwpoison functional tests in mce-test testsuite pass.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Chen Yucong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-07-31 08:16:13 +0800

24 Jul, 2014

1 commit

a0f7a756c mm/rmap.c: fix pgoff calculation to handle hugepage correctly ... Browse Code »

I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
because pgoff's calculation in rmap_walk_anon() fails to consider
compound_order() only to have an incorrect value.

This patch introduces page_to_pgoff(), which gets the page's offset in
PAGE_CACHE_SIZE.

Kirill pointed out that page cache tree should natively handle
hugepages, and in order to make hugetlbfs fit it, page->index of
hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
but page_to_pgoff() contains the point to be fixed in a single function.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: Joonsoo Kim
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-07-24 06:10:54 +0800

04 Jul, 2014

1 commit

0bc1f8b06 hwpoison: fix the handling path of the victimized page frame that belong to non-LRU ... Browse Code »

Until now, the kernel has the same policy to handle victimized page
frames that belong to kernel-space(reserved/slab-subsystem) or
non-LRU(unknown page state). In other word, the result of handling
either of these victimized page frames is (IGNORED | FAILED), and the
return value of memory_failure() is -EBUSY.

This patch is to avoid that memory_failure() returns very soon due to
the "true" value of (!PageLRU(p)), and it also ensures that
action_result() can report more precise information("reserved kernel",
"kernel slab", and "unknown page state") instead of "non LRU",
especially for memory errors which are detected by memory-scrubbing.

Andi said:

: While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
:
: soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
: page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
: page flags: 0x3ffff800004081(locked|slab|head)
: ------------[ cut here ]------------
: kernel BUG at mm/rmap.c:1495!
:
: I think what happened is that a LRU page turned into a slab page in
: parallel with offlining. memory_failure initially tests for this case,
: but doesn't retest later after the page has been locked.
:
: ...
:
: I ran this patch in a loop over night with some stress plus
: the mcelog test suite running in a loop. I cannot guarantee it hit it,
: but it should have given it a good beating.
:
: The kernel survived with no messages, although the mcelog test suite
: got killed at some point because it couldn't fork anymore. Probably
: some unrelated problem.
:
: So the patch is ok for me for .16.

Signed-off-by: Chen Yucong
Acked-by: Naoya Horiguchi
Reported-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Yucong
2014-07-04 00:21:54 +0800

05 Jun, 2014

5 commits

3ba08129e mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) ... Browse Code »

Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Tony Luck
Cc: Kamil Iskra
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Cc: [3.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-05 07:54:13 +0800
74614de17 mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED ... Browse Code »

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Cc: [3.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tony Luck
2014-06-05 07:54:13 +0800
a70ffcac7 mm/memory-failure.c-failure: send right signal code to correct thread ... Browse Code »

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Reported-by: Otto Bruggeman
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Cc: [3.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tony Luck
2014-06-05 07:54:13 +0800
6edd6cc66 mm/memory-failure.c: move comment ... Browse Code »

The comment about pages under writeback is far from the relevant code, so
let's move it to the right place.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-05 07:54:10 +0800
68711a746 mm, migration: add destination page freeing callback ... Browse Code »

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.

Signed-off-by: David Rientjes
Reviewed-by: Naoya Horiguchi
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800