24 Mar, 2019

1 commit

  • commit 46612b751c4941c5c0472ddf04027e877ae5990f upstream.

    When soft_offline_in_use_page() runs on a thp tail page after pmd is
    split, we trigger the following VM_BUG_ON_PAGE():

    Memory failure: 0x3755ff: non anonymous thp
    __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
    Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
    page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
    flags: 0x2fffff80000000()
    raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:519!

    soft_offline_in_use_page() passed refcount and page lock from tail page
    to head page, which is not needed because we can pass any subpage to
    split_huge_page().

    Naoya had fixed a similar issue in c3901e722b29 ("mm: hwpoison: fix thp
    split handling in memory_failure()"). But he missed fixing soft
    offline.

    Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Signed-off-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    zhongjiang
     

07 Feb, 2019

1 commit

  • commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.

    Currently memory_failure() is racy against process's exiting, which
    results in kernel crash by null pointer dereference.

    The root cause is that memory_failure() uses force_sig() to forcibly
    kill asynchronous (meaning not in the current context) processes. As
    discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
    fixes, this is not a right thing to do. OOM solves this issue by using
    do_send_sig_info() as done in commit d2d393099de2 ("signal:
    oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
    patch is suggesting to do the same for hwpoison. do_send_sig_info()
    properly accesses to siglock with lock_task_sighand(), so is free from
    the reported race.

    I confirmed that the reported bug reproduces with inserting some delay
    in kill_procs(), and it never reproduces with this patch.

    Note that memory_failure() can send another type of signal using
    force_sig_mceerr(), and the reported race shouldn't happen on it because
    force_sig_mceerr() is called only for synchronous processes (i.e.
    BUS_MCEERR_AR happens only when some process accesses to the corrupted
    memory.)

    Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Jane Chu
    Reviewed-by: Dan Williams
    Reviewed-by: William Kucharski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

26 Aug, 2018

1 commit

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     

24 Aug, 2018

2 commits

  • A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
    allocate a page that was just freed on the way of soft-offline. This is
    undesirable because soft-offline (which is about corrected error) is
    less aggressive than hard-offline (which is about uncorrected error),
    and we can make soft-offline fail and keep using the page for good
    reason like "system is busy."

    Two main changes of this patch are:

    - setting migrate type of the target page to MIGRATE_ISOLATE. As done
    in free_unref_page_commit(), this makes kernel bypass pcplist when
    freeing the page. So we can assume that the page is in freelist just
    after put_page() returns,

    - setting PG_hwpoison on free page under zone->lock which protects
    freelists, so this allows us to avoid setting PG_hwpoison on a page
    that is decided to be allocated soon.

    [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
    Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: soft-offline: fix race against page allocation".

    Xishi recently reported the issue about race on reusing the target pages
    of soft offlining. Discussion and analysis showed that we need make
    sure that setting PG_hwpoison should be done in the right place under
    zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
    hanldes free buddy page's case.

    This patch (of 2):

    There's a race condition between soft offline and hugetlb_fault which
    causes unexpected process killing and/or hugetlb allocation failure.

    The process killing is caused by the following flow:

    CPU 0 CPU 1 CPU 2

    soft offline
    get_any_page
    // find the hugetlb is free
    mmap a hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    // succeed
    soft_offline_free_page
    // set hwpoison flag
    mmap the hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    find_lock_page
    return VM_FAULT_HWPOISON
    mm_fault_error
    do_sigbus
    // kill the process

    The hugetlb allocation failure comes from the following flow:

    CPU 0 CPU 1

    mmap a hugetlb file
    // reserve all free page but don't fault-in
    soft offline
    get_any_page
    // find the hugetlb is free
    soft_offline_free_page
    // set hwpoison flag
    dissolve_free_huge_page
    // fail because all free hugepages are reserved
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    ...
    dequeue_huge_page_node_exact
    // ignore hwpoisoned hugepage
    // and finally fail due to no-mem

    The root cause of this is that current soft-offline code is written based
    on an assumption that PageHWPoison flag should be set at first to avoid
    accessing the corrupted data. This makes sense for memory_failure() or
    hard offline, but does not for soft offline because soft offline is about
    corrected (not uncorrected) error and is safe from data lost. This patch
    changes soft offline semantics where it sets PageHWPoison flag only after
    containment of the error page completes successfully.

    Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Suggested-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

23 Aug, 2018

1 commit

  • page_freeze_refs/page_unfreeze_refs have already been relplaced by
    page_ref_freeze/page_ref_unfreeze , but they are not modified in the
    comments.

    Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
    Signed-off-by: Jiang Biao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Biao
     

24 Jul, 2018

2 commits

  • mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

    In contrast to typical memory, dev_pagemap pages may be dax mapped. With
    dax there is no possibility to map in another page dynamically since dax
    establishes 1:1 physical address to file offset associations. Also
    dev_pagemap pages associated with NVDIMM / persistent memory devices can
    internal remap/repair addresses with poison. While memory_failure()
    assumes that it can discard typical poisoned pages and keep them
    unmapped indefinitely, dev_pagemap pages may be returned to service
    after the error is cleared.

    Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
    dev_pagemap pages that have poison consumed by userspace. Mark the
    memory as UC instead of unmapping it completely to allow ongoing access
    via the device driver (nd_pmem). Later, nd_pmem will grow support for
    marking the page back to WB when the error is cleared.

    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Jérôme Glisse
    Cc: Matthew Wilcox
    Cc: Naoya Horiguchi
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     
  • In preparation for supporting memory_failure() for dax mappings, teach
    collect_procs() to also determine the mapping size. Unlike typical
    mappings the dax mapping size is determined by walking page-table
    entries rather than using the compound-page accounting for THP pages.

    Acked-by: Naoya Horiguchi
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     

21 Jul, 2018

1 commit


12 Apr, 2018

1 commit

  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Apr, 2018

1 commit

  • Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

13 Feb, 2018

1 commit

  • In the following commit:

    ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

    ... we added code to memory_failure() to unmap the page from the
    kernel 1:1 virtual address space to avoid speculative access to the
    page logging additional errors.

    But memory_failure() may not always succeed in taking the page offline,
    especially if the page belongs to the kernel. This can happen if
    there are too many corrected errors on a page and either mcelog(8)
    or drivers/ras/cec.c asks to take a page offline.

    Since we remove the 1:1 mapping early in memory_failure(), we can
    end up with the page unmapped, but still in use. On the next access
    the kernel crashes :-(

    There are also various debug paths that call memory_failure() to simulate
    occurrence of an error. Since there is no actual error in memory, we
    don't need to map out the page for those cases.

    Revert most of the previous attempt and keep the solution local to
    arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

    1) there is a real error
    2) memory_failure() succeeds.

    All of this only applies to 64-bit systems. 32-bit kernel doesn't map
    all of memory into kernel space. It isn't worth adding the code to unmap
    the piece that is mapped because nobody would run a 32-bit kernel on a
    machine that has recoverable machine checks.

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave
    Cc: Denys Vlasenko
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Robert (Persistent Memory)
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org #v4.14
    Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
    Signed-off-by: Ingo Molnar

    Tony Luck
     

24 Jan, 2018

2 commits


16 Nov, 2017

1 commit

  • On a failed attempt, we get the following entry: soft offline: 0x3c0000:
    migration failed 1, type 17ffffc0008008 (uptodate|head)

    Make this more specific to be straightforward and to follow other error
    log formats in soft_offline_huge_page().

    Link: http://lkml.kernel.org/r/20171016171757.GA3018@ubuntu-desk-vm
    Signed-off-by: Laszlo Toth
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laszlo Toth
     

17 Aug, 2017

1 commit

  • Speculative processor accesses may reference any memory that has a
    valid page table entry. While a speculative access won't generate
    a machine check, it will log the error in a machine check bank. That
    could cause escalation of a subsequent error since the overflow bit
    will be then set in the machine check bank status register.

    Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
    address of the page we want to map out otherwise we may trigger the
    very problem we are trying to avoid. We use a non-canonical address
    that passes through the usual Linux table walking code to get to the
    same "pte".

    Thanks to Dave Hansen for reviewing several iterations of this.

    Also see:

    http://marc.info/?l=linux-mm&m=149860136413338&w=2

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Elliott, Robert (Persistent Memory)
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
    Signed-off-by: Ingo Molnar

    Tony Luck
     

11 Jul, 2017

9 commits

  • new_page is yet another duplication of the migration callback which has
    to handle hugetlb migration specially. We can safely use the generic
    new_page_nodemask for the same purpose.

    Please note that gigantic hugetlb pages do not need any special handling
    because alloc_huge_page_nodemask will make sure to check pages in all
    per node pools. The reason this was done previously was that
    alloc_huge_page_node treated NO_NUMA_NODE and a specific node
    differently and so alloc_huge_page_node(nid) would check on this
    specific node.

    Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reported-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Factoring duplicate code into a function.

    Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

    Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() is a big function and hard to maintain. Handling
    hugetlb- and non-hugetlb- case in a single function is not good, so this
    patch separates PageHuge() branch into a new function, which saves many
    PageHuge() check.

    Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Now we have code to rescue most of healthy pages from a hwpoisoned
    hugepage. So let's apply it to soft_offline_free_page too.

    Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migrated by soft-offline (i.e. due to correctable
    memory errors) is contained as a hugepage, which means many non-error
    pages in it are unreusable, i.e. wasted.

    This patch solves this issue by dissolving source hugepages into buddy.
    As done in previous patch, PageHWPoison is set only on a head page of
    the error hugepage. Then in dissoliving we move the PageHWPoison flag
    to the raw error page so that all healthy subpages return back to buddy.

    [arnd@arndb.de: fix warnings: replace some macros with inline functions]
    Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • We'd like to narrow down the error region in memory error on hugetlb
    pages. However, currently we set PageHWPoison flags on all subpages in
    the error hugepage and add # of subpages to num_hwpoison_pages, which
    doesn't fit our purpose.

    So this patch changes the behavior and we only set PageHWPoison on the
    head page then increase num_hwpoison_pages only by 1. This is a
    preparation for narrow-down part which comes in later patches.

    Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: hwpoison: fixlet for hugetlb migration".

    This patchset updates the hwpoison/hugetlb code to address 2 reported
    issues.

    One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
    http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
    half was already fixed in mainline, and another half about hugetlb cases
    are solved in this series.

    Another issue is "narrow-down error affected region into a single 4kB
    page instead of a whole hugetlb page" issue, which was tried by Anshuman
    (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
    and I updated it to apply it more widely.

    This patch (of 9):

    We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
    as we did before. So current dequeue_huge_page_node() doesn't work as
    intended because it still uses is_migrate_isolate_page() for this check.
    This patch fixes it with PageHWPoison flag.

    Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Though migrating gigantic HugeTLB pages does not sound much like real
    world use case, they can be affected by memory errors. Hence migration
    at the PGD level HugeTLB pages should be supported just to enable soft
    and hard offline use cases.

    While allocating the new gigantic HugeTLB page, it should not matter
    whether new page comes from the same node or not. There would be very
    few gigantic pages on the system afterall, we should not be bothered
    about node locality when trying to save a big page from crashing.

    This change renames dequeu_huge_page_node() function as dequeue_huge
    _page_node_exact() preserving it's original functionality. Now the new
    dequeue_huge_page_node() function scans through all available online nodes
    to allocate a huge page for the NUMA_NO_NODE case and just falls back
    calling dequeu_huge_page_node_exact() for all other cases.

    [arnd@arndb.de: make hstate_is_gigantic() inline]
    Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Arnd Bergmann
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

06 Jul, 2017

1 commit

  • The error code should be negative. Since this ends up in the default case
    anyway, this is harmless, but it's less confusing to negate it. Also,
    later patches will require a negative error code here.

    Link: http://lkml.kernel.org/r/20170525103355.6760-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andrew Morton

    Jeff Layton
     

17 Jun, 2017

1 commit

  • memory_failure() chooses a recovery action function based on the page
    flags. For huge pages it uses the tail page flags which don't have
    anything interesting set, resulting in:

    > Memory failure: 0x9be3b4: Unknown page state
    > Memory failure: 0x9be3b4: recovery action for unknown page: Failed

    Instead, save a copy of the head page's flags if this is a huge page,
    this means if there are no relevant flags for this tail page, we use the
    head pages flags instead. This results in the me_huge_page() recovery
    action being called:

    > Memory failure: 0x9b7969: recovery action for huge page: Delayed

    For hugepages that have not yet been allocated, this allows the hugepage
    to be dequeued.

    Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
    Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com
    Signed-off-by: James Morse
    Tested-by: Punit Agrawal
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

03 Jun, 2017

1 commit

  • On failing to migrate a page, soft_offline_huge_page() performs the
    necessary update to the hugepage ref-count.

    But when !hugepage_migration_supported() , unmap_and_move_hugepage()
    also decrements the page ref-count for the hugepage. The combined
    behaviour leaves the ref-count in an inconsistent state.

    This leads to soft lockups when running the overcommitted hugepage test
    from mce-tests suite.

    Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
    soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
    INFO: rcu_preempt detected stalls on CPUs/tasks:
    Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
    (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
    thugetlb_overco R running task 0 2715 2685 0x00000008
    Call trace:
    dump_backtrace+0x0/0x268
    show_stack+0x24/0x30
    sched_show_task+0x134/0x180
    rcu_print_detail_task_stall_rnp+0x54/0x7c
    rcu_check_callbacks+0xa74/0xb08
    update_process_times+0x34/0x60
    tick_sched_handle.isra.7+0x38/0x70
    tick_sched_timer+0x4c/0x98
    __hrtimer_run_queues+0xc0/0x300
    hrtimer_interrupt+0xac/0x228
    arch_timer_handler_phys+0x3c/0x50
    handle_percpu_devid_irq+0x8c/0x290
    generic_handle_irq+0x34/0x50
    __handle_domain_irq+0x68/0xc0
    gic_handle_irq+0x5c/0xb0

    Address this by changing the putback_active_hugepage() in
    soft_offline_huge_page() to putback_movable_pages().

    This only triggers on systems that enable memory failure handling
    (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
    (!ARCH_ENABLE_HUGEPAGE_MIGRATION).

    I imagine this wasn't triggered as there aren't many systems running
    this configuration.

    [akpm@linux-foundation.org: remove dead comment, per Naoya]
    Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
    Reported-by: Manoj Iyer
    Tested-by: Manoj Iyer
    Suggested-by: Naoya Horiguchi
    Signed-off-by: Punit Agrawal
    Cc: Joonsoo Kim
    Cc: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

13 May, 2017

1 commit

  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

5 commits

  • Memory error handler calls try_to_unmap() for error pages in various
    states. If the error page is a mlocked page, error handling could fail
    with "still referenced by 1 users" message. This is because the page is
    linked to and stays in lru cache after the following call chain.

    try_to_unmap_one
    page_remove_rmap
    clear_page_mlock
    putback_lru_page
    lru_cache_add

    memory_failure() calls shake_page() to hanlde the similar issue, but
    current code doesn't cover because shake_page() is called only before
    try_to_unmap(). So this patches adds shake_page().

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • shake_page() is called before going into core error handling code in
    order to ensure that the error page is flushed from lru_cache lists
    where pages stay during transferring among LRU lists.

    But currently it's not fully functional because when the page is linked
    to lru_cache by calling activate_page(), its PageLRU flag is set and
    shake_page() is skipped. The result is to fail error handling with
    "still referenced by 1 users" message.

    When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
    is clear, so that's fine.

    This patch makes shake_page() unconditionally called to avoild the
    failure.

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • It helps to provide page flag description along with the raw value in
    error paths during soft offline process. From sample experiments

    Before the patch:

    soft offline: 0x6100: migration failed 1, type 3ffff800008018
    soft offline: 0x7400: migration failed 1, type 3ffff800008018

    After the patch:

    soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)
    soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)

    Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "mm: fix some MADV_FREE issues", v5.

    We are trying to use MADV_FREE in jemalloc. Several issues are found.
    Without solving the issues, jemalloc can't use the MADV_FREE feature.

    - Doesn't support system without swap enabled. Because if swap is off,
    we can't or can't efficiently age anonymous pages. And since
    MADV_FREE pages are mixed with other anonymous pages, we can't
    reclaim MADV_FREE pages. In current implementation, MADV_FREE will
    fallback to MADV_DONTNEED without swap enabled. But in our
    environment, a lot of machines don't enable swap. This will prevent
    our setup using MADV_FREE.

    - Increases memory pressure. page reclaim bias file pages reclaim
    against anonymous pages. This doesn't make sense for MADV_FREE pages,
    because those pages could be freed easily and refilled with very
    slight penality. Even page reclaim doesn't bias file pages, there is
    still an issue, because MADV_FREE pages and other anonymous pages are
    mixed together. To reclaim a MADV_FREE page, we probably must scan a
    lot of other anonymous pages, which is inefficient. In our test, we
    usually see oom with MADV_FREE enabled and nothing without it.

    - Accounting. There are two accounting problems. We don't have a global
    accounting. If the system is abnormal, we don't know if it's a
    problem from MADV_FREE side. The other problem is RSS accounting.
    MADV_FREE pages are accounted as normal anon pages and reclaimed
    lazily, so application's RSS becomes bigger. This confuses our
    workloads. We have monitoring daemon running and if it finds
    applications' RSS becomes abnormal, the daemon will kill the
    applications even kernel can reclaim the memory easily.

    To address the first the two issues, we can either put MADV_FREE pages
    into a separate LRU list (Minchan's previous patches and V1 patches), or
    put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
    patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
    tiny nowadays and should be full of used once file pages. So we can
    still efficiently reclaim MADV_FREE pages there without interference
    with other anon and active file pages. Putting the pages into inactive
    file list also has an advantage which allows page reclaim to prioritize
    MADV_FREE pages and used once file pages. MADV_FREE pages are put into
    the lru list and clear SwapBacked flag, so PageAnon(page) &&
    !PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
    directly freed without pageout if they are clean, otherwise normal swap
    will reclaim them.

    For the third issue, the previous post adds global accounting and a
    separate RSS count for MADV_FREE pages. The problem is we never get
    accurate accounting for MADV_FREE pages. The pages are mapped to
    userspace, can be dirtied without notice from kernel side. To get
    accurate accounting, we could write protect the page, but then there is
    extra page fault overhead, which people don't want to pay. Jemalloc
    guys have concerns about the inaccurate accounting, so this post drops
    the accounting patches temporarily. The info exported to
    /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
    can get accurate accounting right now.

    This patch (of 6):

    Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
    always have the flag set if we want to do an unmap. For cases we don't
    do an unmap, the TTU_LZFREE part of code should never run.

    Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
    TTU_MIGRATION), an unmap is implied.

    The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
    code

    Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

02 Mar, 2017

2 commits


25 Feb, 2017

1 commit

  • Extend soft offlining framework to support non-lru page, which already
    support migration after commit bda807d44454 ("mm: migrate: support
    non-lru movable page migration")

    When memory corrected errors occur on a non-lru movable page, we can
    choose to stop using it by migrating data onto another page and disable
    the original (maybe half-broken) one.

    Link: http://lkml.kernel.org/r/1485867981-16037-4-git-send-email-ysxie@foxmail.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Suggested-by: Minchan Kim
    Reviewed-by: Minchan Kim
    Acked-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Hanjun Guo
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Taku Izumi
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

26 Dec, 2016

1 commit