15 Aug, 2015

3 commits

  • Bug:

    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:1957!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
    CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
    Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
    task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
    RIP: split_huge_page_to_list+0xdb/0x120
    Call Trace:
    memory_failure+0x32e/0x7c0
    madvise_hwpoison+0x8b/0x160
    SyS_madvise+0x40/0x240
    ? do_page_fault+0x37/0x90
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
    RIP split_huge_page_to_list+0xdb/0x120
    RSP
    ---[ end trace aee7ce0df8e44076 ]---

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define MB 1024*1024

    int main(void)
    {
    char *mem;

    posix_memalign((void **)&mem, 2 * MB, 200 * MB);

    madvise(mem, 200 * MB, MADV_HWPOISON);

    free(mem);

    return 0;
    }

    Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
    The get_user_pages_fast() which called in madvise_hwpoison() will get
    huge zero page if the page is not allocated before. Huge zero page is a
    tranparent huge page, however, it is not an anonymous page.
    memory_failure will split the huge zero page and trigger
    BUG_ON(is_huge_zero_page(page));

    After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling
    for non-tail-refcounted thp"), memory_failure will not catch non anon
    thp from madvise_hwpoison path and this bug occur.

    Fix it by catching non anon thp in memory_failure in order to not split
    huge zero page in madvise_hwpoison path.

    After this patch:

    Injecting memory failure for page 0x202800 at 0x7fd8ae800000
    MCE: 0x202800: non anonymous thp
    [...]

    [akpm@linux-foundation.org: remove second split, per Wanpeng]
    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Hugetlbfs pages will get a refcount in get_any_page() or
    madvise_hwpoison() if soft offlining through madvise. The refcount which
    is held by the soft offline path should be released if we fail to isolate
    hugetlbfs pages.

    Fix it by reducing the refcount for both isolation success and failure.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • After trying to drain pages from pagevec/pageset, we try to get reference
    count of the page again, however, the reference count of the page is not
    reduced if the page is still not on LRU list.

    Fix it by adding the put_page() to drop the page reference which is from
    __get_any_page().

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

07 Aug, 2015

4 commits

  • Now page freeing code doesn't consider PageHWPoison as a bad page, so by
    setting it before completing the page containment, we can prevent the
    error page from being reused just after successful page migration.

    I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
    page table entry is transformed into migration entry, not to hwpoison
    entry.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • "non anonymous thp" case is still racy with freeing thp, which causes
    panic due to put_page() for refcount-0 page. It seems that closing up
    this race might be hard (and/or not worth doing,) so let's give up the
    error handling for this case.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When memory_failure() is called on a page which are just freed after
    page migration from soft offlining, the counter num_poisoned_pages is
    raised twi= ce. So let's fix it with using TestSetPageHWPoison.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Recently I addressed a few of hwpoison race problems and the patches are
    merged on v4.2-rc1. It made progress, but unfortunately some problems
    still remain due to less coverage of my testing. So I'm trying to fix
    or avoid them in this series.

    One point I'm expecting to discuss is that patch 4/5 changes the page
    flag set to be checked on free time. In current behavior, __PG_HWPOISON
    is not supposed to be set when the page is freed. I think that there is
    no strong reason for this behavior, and it causes a problem hard to fix
    only in error handler side (because __PG_HWPOISON could be set at
    arbitrary timing.) So I suggest to change it.

    With this patchset, hwpoison stress testing in official mce-test
    testsuite (which previously failed) passes.

    This patch (of 5):

    In "just unpoisoned" path, we do put_page and then unlock_page, which is
    a wrong order and causes "freeing locked page" bug. So let's fix it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

25 Jun, 2015

9 commits

  • RAS user space tools like rasdaemon which base on trace event, could
    receive mce error event, but no memory recovery result event. So, I want
    to add this event to make this scenario complete.

    This patch add a event at ras group for memory-failure.

    The output like below:
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 2/2 #P:24
    #
    # _-----=> irqs-off
    # / _----=> need-resched
    # | / _---=> hardirq/softirq
    # || / _--=> preempt-depth
    # ||| / delay
    # TASK-PID CPU# |||| TIMESTAMP FUNCTION
    # | | | |||| | |
    mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

    [xiexiuqi@huawei.com: fix build error]
    Signed-off-by: Xie XiuQi
    Reviewed-by: Naoya Horiguchi
    Acked-by: Steven Rostedt
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Jim Davis
    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Change type of action_result's param 3 to enum for type consistency,
    and rename mf_outcome to mf_result for clearly.

    Signed-off-by: Xie XiuQi
    Acked-by: Naoya Horiguchi
    Cc: Chen Gong
    Cc: Jim Davis
    Cc: Steven Rostedt
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Export 'outcome' and 'action_page_type' to mm.h, so we could use
    this emnus outside.

    This patch is preparation for adding trace events for memory-failure
    recovery action.

    Signed-off-by: Xie XiuQi
    Acked-by: Naoya Horiguchi
    Cc: Chen Gong
    Cc: Jim Davis
    Cc: Steven Rostedt
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • memory_failure() is supposed not to handle thp itself, but to split it.
    But if something were wrong and page_action() were called on thp,
    me_huge_page() (action routine for hugepages) should be better to take
    no action, rather than to take wrong action prepared for hugetlb (which
    triggers BUG_ON().)

    This change is for potential problems, but makes sense to me because thp
    is an actively developing feature and this code path can be open in the
    future.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Stress testing showed that soft offline events for a process iterating
    "mmap-pagefault-munmap" loop can trigger
    VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

    Soft offlining page 0x70fe1 at 0x70100008d000
    Soft offlining page 0x705fb at 0x70300008d000
    page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
    flags: 0x1fffff80800000(hwpoison)
    page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
    CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
    RIP: free_pcppages_bulk+0x52a/0x6f0
    Call Trace:
    drain_pages_zone+0x3d/0x50
    drain_local_pages+0x1d/0x30
    on_each_cpu_mask+0x46/0x80
    drain_all_pages+0x14b/0x1e0
    soft_offline_page+0x432/0x6e0
    SyS_madvise+0x73c/0x780
    system_call_fastpath+0x12/0x17
    Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
    RIP [] free_pcppages_bulk+0x52a/0x6f0
    RSP
    ---[ end trace 53926436e76d1f35 ]---

    When soft offline successfully migrates page, the source page is supposed
    to be freed. But there is a race condition where a source page looks
    isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
    somewhat linked to pcplist. Then another soft offline event calls
    drain_all_pages() and tries to free such hwpoisoned page, which is
    forbidden.

    This odd page state seems to happen due to the race between put_page() in
    putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
    with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
    drop lru_add_drain_all() in __soft_offline_page()", or to change page
    freeing code for this soft offline's purpose.

    Instead, let's think about the difference between hard offline and soft
    offline. There is an interesting difference in how to isolate the in-use
    page between these, that is, hard offline marks PageHWPoison of the target
    page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
    offline tries to free the target page then marks PageHWPoison. This
    difference might be the source of complexity and result in bugs like the
    above. So making soft offline isolate with keeping refcount can be a
    solution for this problem.

    We can pass to page migration code the "reason" which shows the caller, so
    let's use this more to avoid calling putback_lru_page() when called from
    soft offline, which effectively does the isolation for soft offline. With
    this change, target pages of soft offline never be reused without changing
    migratetype, so this patch also removes the related code.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() can run in 2 different mode (specified by
    MF_COUNT_INCREASED) in page refcount perspective. When
    MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
    takes a refcount of the target page. And if cleared, memory_failure()
    takes it in it's own.

    In current code, however, refcounting is done differently in each caller.
    For example, madvise_hwpoison() uses get_user_pages_fast() and
    hwpoison_inject() uses get_page_unless_zero(). So this inconsistent
    refcounting causes refcount failure especially for thp tail pages.
    Typical user visible effects are like memory leak or
    VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

    To fix this refcounting issue, this patch introduces get_hwpoison_page()
    to handle thp tail pages in the same manner for each caller of hwpoison
    code.

    memory_failure() might fail to split thp and in such case it returns
    without completing page isolation. This is not good because PageHWPoison
    on the thp is still set and there's no easy way to unpoison such thps. So
    this patch try to roll back any action to the thp in "non anonymous thp"
    case and "thp split failed" case, expecting an MCE(SRAR) generated by
    later access afterward will properly free such thps.

    [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() doesn't handle thp itself at this time and need to split
    it before doing isolation. Currently thp is split in the middle of
    hwpoison_user_mappings(), but there're corner cases where memory_failure()
    wrongly tries to handle thp without splitting.

    1) "non anonymous" thp, which is not a normal operating mode of thp,
    but a memory error could hit a thp before anon_vma is initialized. In
    such case, split_huge_page() fails and me_huge_page() (intended for
    hugetlb) is called for thp, which triggers BUG_ON in page_hstate().

    2) !PageLRU case, where hwpoison_user_mappings() returns with
    SWAP_SUCCESS and the result is the same as case 1.

    memory_failure() can't avoid splitting, so let's split it more earlier,
    which also reduces code which are prepared for both of normal page and
    thp.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • All the items mentioned here have been either addressed, or were not
    really needed. So just remove the comment.

    Signed-off-by: Andi Kleen
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Here's another comment fix for hwpoison.

    It describes the "guiding principle" on when to add new
    memory error recovery code.

    Signed-off-by: Andi Kleen
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

06 May, 2015

2 commits

  • If multiple soft offline events hit one free page/hugepage concurrently,
    soft_offline_page() can handle the free page/hugepage multiple times,
    which makes num_poisoned_pages counter increased more than once. This
    patch fixes this wrong counting by checking TestSetPageHWPoison for normal
    papes and by checking the return value of dequeue_hwpoisoned_huge_page()
    for hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently memory_failure() calls shake_page() to sweep pages out from
    pcplists only when the victim page is 4kB LRU page or thp head page.
    But we should do this for a thp tail page too.

    Consider that a memory error hits a thp tail page whose head page is on
    a pcplist when memory_failure() runs. Then, the current kernel skips
    shake_pages() part, so hwpoison_user_mappings() returns without calling
    split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
    still cleared due to the skip of shake_page().

    As a result, me_huge_page() runs for the thp, which is broken behavior.

    One effect is a leak of the thp. And another is to fail to isolate the
    memory error, so later access to the error address causes another MCE,
    which kills the processes which used the thp.

    This patch fixes this problem by calling shake_page() for thp tail case.

    Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Acked-by: Dean Nelson
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Cc: Jin Dongming
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Apr, 2015

2 commits

  • We are not safe from calling isolate_huge_page() on a hugepage
    concurrently, which can make the victim hugepage in invalid state and
    results in BUG_ON().

    The root problem of this is that we don't have any information on struct
    page (so easily accessible) about hugepages' activeness. Note that
    hugepages' activeness means just being linked to
    hstate->hugepage_activelist, which is not the same as normal pages'
    activeness represented by PageActive flag.

    Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
    before isolation, so let's do similarly for hugetlb with a new
    paeg_huge_active().

    set/clear_page_huge_active() should be called within hugetlb_lock. But
    hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
    in these functions set_page_huge_active() is called right after the
    hugepage is allocated and no other thread tries to isolate it.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
    [fengguang.wu@intel.com: set_page_huge_active() can be static]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • This cleanup patch moves all strings passed to action_result() into a
    singl= e array action_page_type so that a reader can easily find which
    kind of actio= n results are possible. And this patch also fixes the
    odd lines to be printed out, like "unknown page state page" or "free
    buddy, 2nd try page".

    [akpm@linux-foundation.org: rename messages, per David]
    [akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: "Xie XiuQi"
    Cc: Steven Rostedt
    Cc: Chen Gong
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

13 Feb, 2015

2 commits

  • A race condition starts to be visible in recent mmotm, where a PG_hwpoison
    flag is set on a migration source page *before* it's back in buddy page
    poo= l.

    This is problematic because no page flag is supposed to be set when
    freeing (see __free_one_page().) So the user-visible effect of this race
    is that it could trigger the BUG_ON() when soft-offlining is called.

    The root cause is that we call lru_add_drain_all() to make sure that the
    page is in buddy, but that doesn't work because this function just
    schedule= s a work item and doesn't wait its completion.
    drain_all_pages() does drainin= g directly, so simply dropping
    lru_add_drain_all() solves this problem.

    Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: [3.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag
    set, it will be called per memory cgroup. The memory cgroup to scan
    objects from is passed in shrink_control->memcg. If the memory cgroup
    is NULL, a memcg aware shrinker is supposed to scan objects from the
    global list. Unaware shrinkers are only called on global pressure with
    memcg=NULL.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Dec, 2014

3 commits

  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No brainer conversion: collect_procs_file() only schedules a process for
    later kill, share the lock, similarly to the anon vma variant.

    Signed-off-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Dec, 2014

3 commits

  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • Memory hotplug and failure mechanisms have several places where pcplists
    are drained so that pages are returned to the buddy allocator and can be
    e.g. prepared for offlining. This is always done in the context of a
    single zone, we can reduce the pcplists drain to the single zone, which
    is now possible.

    The change should make memory offlining due to hotremove or failure
    faster and not disturbing unrelated pcplists anymore.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The functions for draining per-cpu pages back to buddy allocators
    currently always operate on all zones. There are however several cases
    where the drain is only needed in the context of a single zone, and
    spilling other pcplists is a waste of time both due to the extra
    spilling and later refilling.

    This patch introduces new zone pointer parameter to drain_all_pages()
    and changes the dummy parameter of drain_local_pages() to be also a zone
    pointer. When NULL is passed, the functions operate on all zones as
    usual. Passing a specific zone pointer reduces the work to the single
    zone.

    All callers are updated to pass the NULL pointer in this patch.
    Conversion to single zone (where appropriate) is done in further
    patches.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

22 Oct, 2014

1 commit

  • When Uncorrected error happens, if the poisoned page is referenced
    by more than one user after error recovery, the recovery is not
    successful. But currently the display result is wrong.
    Before this patch:

    MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
    MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
    mce: Memory error not recovered

    After this patch:

    MCE 0x44e336: dirty mlocked LRU page recovery: Failed
    MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
    mce: Memory error not recovered

    Signed-off-by: Chen, Gong
    Link: http://lkml.kernel.org/r/1406530260-26078-3-git-send-email-gong.chen@linux.intel.com
    Acked-by: Naoya Horiguchi
    Acked-by: Tony Luck
    Signed-off-by: Borislav Petkov

    Chen, Gong
     

19 Sep, 2014

1 commit


07 Aug, 2014

1 commit

  • When a hwpoison page is locked it could change state due to parallel
    modifications. The original compound page can be torn down and then
    this 4k page becomes part of a differently-size compound page is is a
    standalone regular page.

    Check after the lock if the page is still the same compound page.

    We could go back, grab the new head page and try again but it should be
    quite rare, so I thought this was safest. A retry loop would be more
    difficult to test and may have more side effects.

    The hwpoison code by design only tries to handle cases that are
    reasonably common in workloads, as visible in page-flags.

    I'm not really that concerned about handling this (likely rare case),
    just not crashing on it.

    Signed-off-by: Andi Kleen
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

31 Jul, 2014

2 commits

  • hwpoison_user_mappings() could fail for various reasons, so printk()s to
    print out the reasons should be done in each failure check inside
    hwpoison_user_mappings().

    And currently we don't call action_result() when hwpoison_user_mappings()
    fails, which is not consistent with other exit points of memory error
    handler. So this patch fixes these messaging problems.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
    handling path of the victimized page frame that belong to non-LRU")
    rejects going into unmapping operation for hugetlbfs/thp pages, which
    results in failing error containing on such pages. This patch fixes it.

    With this patch, hwpoison functional tests in mce-test testsuite pass.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jul, 2014

1 commit

  • I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
    anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
    because pgoff's calculation in rmap_walk_anon() fails to consider
    compound_order() only to have an incorrect value.

    This patch introduces page_to_pgoff(), which gets the page's offset in
    PAGE_CACHE_SIZE.

    Kirill pointed out that page cache tree should natively handle
    hugepages, and in order to make hugetlbfs fit it, page->index of
    hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
    but page_to_pgoff() contains the point to be fixed in a single function.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Jul, 2014

1 commit

  • Until now, the kernel has the same policy to handle victimized page
    frames that belong to kernel-space(reserved/slab-subsystem) or
    non-LRU(unknown page state). In other word, the result of handling
    either of these victimized page frames is (IGNORED | FAILED), and the
    return value of memory_failure() is -EBUSY.

    This patch is to avoid that memory_failure() returns very soon due to
    the "true" value of (!PageLRU(p)), and it also ensures that
    action_result() can report more precise information("reserved kernel",
    "kernel slab", and "unknown page state") instead of "non LRU",
    especially for memory errors which are detected by memory-scrubbing.

    Andi said:

    : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
    :
    : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
    : page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
    : page flags: 0x3ffff800004081(locked|slab|head)
    : ------------[ cut here ]------------
    : kernel BUG at mm/rmap.c:1495!
    :
    : I think what happened is that a LRU page turned into a slab page in
    : parallel with offlining. memory_failure initially tests for this case,
    : but doesn't retest later after the page has been locked.
    :
    : ...
    :
    : I ran this patch in a loop over night with some stress plus
    : the mcelog test suite running in a loop. I cannot guarantee it hit it,
    : but it should have given it a good beating.
    :
    : The kernel survived with no messages, although the mcelog test suite
    : got killed at some point because it couldn't fork anymore. Probably
    : some unrelated problem.
    :
    : So the patch is ok for me for .16.

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     

05 Jun, 2014

5 commits

  • Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When Linux sees an "action optional" machine check (where h/w has reported
    an error that is not in the current execution path) we generally do not
    want to signal a process, since most processes do not have a SIGBUS
    handler - we'd just prematurely terminate the process for a problem that
    they might never actually see.

    task_early_kill() decides whether to consider a process - and it checks
    whether this specific process has been marked for early signals with
    "prctl", or if the system administrator has requested early signals for
    all processes using /proc/sys/vm/memory_failure_early_kill.

    But for MF_ACTION_REQUIRED case we must not defer. The error is in the
    execution path of the current thread so we must send the SIGBUS
    immediatley.

    Fix by passing a flag argument through collect_procs*() to
    task_early_kill() so it knows whether we can defer or must take action.

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • When a thread in a multi-threaded application hits a machine check because
    of an uncorrectable error in memory - we want to send the SIGBUS with
    si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
    if the active thread is not the primary thread in the process.
    collect_procs() just finds primary threads and this test:

    if ((flags & MF_ACTION_REQUIRED) && t == current) {

    will see that the thread we found isn't the current thread and so send a
    si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
    thread at this time).

    We can fix this by checking whether "current" shares the same mm with the
    process that collect_procs() said owned the page. If so, we send the
    SIGBUS to current (with code BUS_MCEERR_AR).

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Reported-by: Otto Bruggeman
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • The comment about pages under writeback is far from the relevant code, so
    let's move it to the right place.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes