07 Feb, 2019

1 commit

  • commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.

    Currently memory_failure() is racy against process's exiting, which
    results in kernel crash by null pointer dereference.

    The root cause is that memory_failure() uses force_sig() to forcibly
    kill asynchronous (meaning not in the current context) processes. As
    discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
    fixes, this is not a right thing to do. OOM solves this issue by using
    do_send_sig_info() as done in commit d2d393099de2 ("signal:
    oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
    patch is suggesting to do the same for hwpoison. do_send_sig_info()
    properly accesses to siglock with lock_task_sighand(), so is free from
    the reported race.

    I confirmed that the reported bug reproduces with inserting some delay
    in kill_procs(), and it never reproduces with this patch.

    Note that memory_failure() can send another type of signal using
    force_sig_mceerr(), and the reported race shouldn't happen on it because
    force_sig_mceerr() is called only for synchronous processes (i.e.
    BUS_MCEERR_AR happens only when some process accesses to the corrupted
    memory.)

    Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Jane Chu
    Reviewed-by: Dan Williams
    Reviewed-by: William Kucharski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

11 Jul, 2018

1 commit

  • commit 31286a8484a85e8b4e91ddb0f5415aee8a416827 upstream.

    Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

22 Feb, 2018

1 commit

  • commit fd0e786d9d09024f67bd71ec094b110237dc3840 upstream.

    In the following commit:

    ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

    ... we added code to memory_failure() to unmap the page from the
    kernel 1:1 virtual address space to avoid speculative access to the
    page logging additional errors.

    But memory_failure() may not always succeed in taking the page offline,
    especially if the page belongs to the kernel. This can happen if
    there are too many corrected errors on a page and either mcelog(8)
    or drivers/ras/cec.c asks to take a page offline.

    Since we remove the 1:1 mapping early in memory_failure(), we can
    end up with the page unmapped, but still in use. On the next access
    the kernel crashes :-(

    There are also various debug paths that call memory_failure() to simulate
    occurrence of an error. Since there is no actual error in memory, we
    don't need to map out the page for those cases.

    Revert most of the previous attempt and keep the solution local to
    arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

    1) there is a real error
    2) memory_failure() succeeds.

    All of this only applies to 64-bit systems. 32-bit kernel doesn't map
    all of memory into kernel space. It isn't worth adding the code to unmap
    the piece that is mapped because nobody would run a 32-bit kernel on a
    machine that has recoverable machine checks.

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave
    Cc: Denys Vlasenko
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Robert (Persistent Memory)
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org #v4.14
    Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Tony Luck
     

17 Aug, 2017

1 commit

  • Speculative processor accesses may reference any memory that has a
    valid page table entry. While a speculative access won't generate
    a machine check, it will log the error in a machine check bank. That
    could cause escalation of a subsequent error since the overflow bit
    will be then set in the machine check bank status register.

    Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
    address of the page we want to map out otherwise we may trigger the
    very problem we are trying to avoid. We use a non-canonical address
    that passes through the usual Linux table walking code to get to the
    same "pte".

    Thanks to Dave Hansen for reviewing several iterations of this.

    Also see:

    http://marc.info/?l=linux-mm&m=149860136413338&w=2

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Elliott, Robert (Persistent Memory)
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
    Signed-off-by: Ingo Molnar

    Tony Luck
     

11 Jul, 2017

9 commits

  • new_page is yet another duplication of the migration callback which has
    to handle hugetlb migration specially. We can safely use the generic
    new_page_nodemask for the same purpose.

    Please note that gigantic hugetlb pages do not need any special handling
    because alloc_huge_page_nodemask will make sure to check pages in all
    per node pools. The reason this was done previously was that
    alloc_huge_page_node treated NO_NUMA_NODE and a specific node
    differently and so alloc_huge_page_node(nid) would check on this
    specific node.

    Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reported-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Factoring duplicate code into a function.

    Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

    Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() is a big function and hard to maintain. Handling
    hugetlb- and non-hugetlb- case in a single function is not good, so this
    patch separates PageHuge() branch into a new function, which saves many
    PageHuge() check.

    Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Now we have code to rescue most of healthy pages from a hwpoisoned
    hugepage. So let's apply it to soft_offline_free_page too.

    Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migrated by soft-offline (i.e. due to correctable
    memory errors) is contained as a hugepage, which means many non-error
    pages in it are unreusable, i.e. wasted.

    This patch solves this issue by dissolving source hugepages into buddy.
    As done in previous patch, PageHWPoison is set only on a head page of
    the error hugepage. Then in dissoliving we move the PageHWPoison flag
    to the raw error page so that all healthy subpages return back to buddy.

    [arnd@arndb.de: fix warnings: replace some macros with inline functions]
    Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • We'd like to narrow down the error region in memory error on hugetlb
    pages. However, currently we set PageHWPoison flags on all subpages in
    the error hugepage and add # of subpages to num_hwpoison_pages, which
    doesn't fit our purpose.

    So this patch changes the behavior and we only set PageHWPoison on the
    head page then increase num_hwpoison_pages only by 1. This is a
    preparation for narrow-down part which comes in later patches.

    Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: hwpoison: fixlet for hugetlb migration".

    This patchset updates the hwpoison/hugetlb code to address 2 reported
    issues.

    One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
    http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
    half was already fixed in mainline, and another half about hugetlb cases
    are solved in this series.

    Another issue is "narrow-down error affected region into a single 4kB
    page instead of a whole hugetlb page" issue, which was tried by Anshuman
    (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
    and I updated it to apply it more widely.

    This patch (of 9):

    We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
    as we did before. So current dequeue_huge_page_node() doesn't work as
    intended because it still uses is_migrate_isolate_page() for this check.
    This patch fixes it with PageHWPoison flag.

    Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Though migrating gigantic HugeTLB pages does not sound much like real
    world use case, they can be affected by memory errors. Hence migration
    at the PGD level HugeTLB pages should be supported just to enable soft
    and hard offline use cases.

    While allocating the new gigantic HugeTLB page, it should not matter
    whether new page comes from the same node or not. There would be very
    few gigantic pages on the system afterall, we should not be bothered
    about node locality when trying to save a big page from crashing.

    This change renames dequeu_huge_page_node() function as dequeue_huge
    _page_node_exact() preserving it's original functionality. Now the new
    dequeue_huge_page_node() function scans through all available online nodes
    to allocate a huge page for the NUMA_NO_NODE case and just falls back
    calling dequeu_huge_page_node_exact() for all other cases.

    [arnd@arndb.de: make hstate_is_gigantic() inline]
    Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Arnd Bergmann
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

06 Jul, 2017

1 commit

  • The error code should be negative. Since this ends up in the default case
    anyway, this is harmless, but it's less confusing to negate it. Also,
    later patches will require a negative error code here.

    Link: http://lkml.kernel.org/r/20170525103355.6760-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andrew Morton

    Jeff Layton
     

17 Jun, 2017

1 commit

  • memory_failure() chooses a recovery action function based on the page
    flags. For huge pages it uses the tail page flags which don't have
    anything interesting set, resulting in:

    > Memory failure: 0x9be3b4: Unknown page state
    > Memory failure: 0x9be3b4: recovery action for unknown page: Failed

    Instead, save a copy of the head page's flags if this is a huge page,
    this means if there are no relevant flags for this tail page, we use the
    head pages flags instead. This results in the me_huge_page() recovery
    action being called:

    > Memory failure: 0x9b7969: recovery action for huge page: Delayed

    For hugepages that have not yet been allocated, this allows the hugepage
    to be dequeued.

    Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
    Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com
    Signed-off-by: James Morse
    Tested-by: Punit Agrawal
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

03 Jun, 2017

1 commit

  • On failing to migrate a page, soft_offline_huge_page() performs the
    necessary update to the hugepage ref-count.

    But when !hugepage_migration_supported() , unmap_and_move_hugepage()
    also decrements the page ref-count for the hugepage. The combined
    behaviour leaves the ref-count in an inconsistent state.

    This leads to soft lockups when running the overcommitted hugepage test
    from mce-tests suite.

    Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
    soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
    INFO: rcu_preempt detected stalls on CPUs/tasks:
    Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
    (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
    thugetlb_overco R running task 0 2715 2685 0x00000008
    Call trace:
    dump_backtrace+0x0/0x268
    show_stack+0x24/0x30
    sched_show_task+0x134/0x180
    rcu_print_detail_task_stall_rnp+0x54/0x7c
    rcu_check_callbacks+0xa74/0xb08
    update_process_times+0x34/0x60
    tick_sched_handle.isra.7+0x38/0x70
    tick_sched_timer+0x4c/0x98
    __hrtimer_run_queues+0xc0/0x300
    hrtimer_interrupt+0xac/0x228
    arch_timer_handler_phys+0x3c/0x50
    handle_percpu_devid_irq+0x8c/0x290
    generic_handle_irq+0x34/0x50
    __handle_domain_irq+0x68/0xc0
    gic_handle_irq+0x5c/0xb0

    Address this by changing the putback_active_hugepage() in
    soft_offline_huge_page() to putback_movable_pages().

    This only triggers on systems that enable memory failure handling
    (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
    (!ARCH_ENABLE_HUGEPAGE_MIGRATION).

    I imagine this wasn't triggered as there aren't many systems running
    this configuration.

    [akpm@linux-foundation.org: remove dead comment, per Naoya]
    Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
    Reported-by: Manoj Iyer
    Tested-by: Manoj Iyer
    Suggested-by: Naoya Horiguchi
    Signed-off-by: Punit Agrawal
    Cc: Joonsoo Kim
    Cc: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

13 May, 2017

1 commit

  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

5 commits

  • Memory error handler calls try_to_unmap() for error pages in various
    states. If the error page is a mlocked page, error handling could fail
    with "still referenced by 1 users" message. This is because the page is
    linked to and stays in lru cache after the following call chain.

    try_to_unmap_one
    page_remove_rmap
    clear_page_mlock
    putback_lru_page
    lru_cache_add

    memory_failure() calls shake_page() to hanlde the similar issue, but
    current code doesn't cover because shake_page() is called only before
    try_to_unmap(). So this patches adds shake_page().

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • shake_page() is called before going into core error handling code in
    order to ensure that the error page is flushed from lru_cache lists
    where pages stay during transferring among LRU lists.

    But currently it's not fully functional because when the page is linked
    to lru_cache by calling activate_page(), its PageLRU flag is set and
    shake_page() is skipped. The result is to fail error handling with
    "still referenced by 1 users" message.

    When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
    is clear, so that's fine.

    This patch makes shake_page() unconditionally called to avoild the
    failure.

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • It helps to provide page flag description along with the raw value in
    error paths during soft offline process. From sample experiments

    Before the patch:

    soft offline: 0x6100: migration failed 1, type 3ffff800008018
    soft offline: 0x7400: migration failed 1, type 3ffff800008018

    After the patch:

    soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)
    soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)

    Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "mm: fix some MADV_FREE issues", v5.

    We are trying to use MADV_FREE in jemalloc. Several issues are found.
    Without solving the issues, jemalloc can't use the MADV_FREE feature.

    - Doesn't support system without swap enabled. Because if swap is off,
    we can't or can't efficiently age anonymous pages. And since
    MADV_FREE pages are mixed with other anonymous pages, we can't
    reclaim MADV_FREE pages. In current implementation, MADV_FREE will
    fallback to MADV_DONTNEED without swap enabled. But in our
    environment, a lot of machines don't enable swap. This will prevent
    our setup using MADV_FREE.

    - Increases memory pressure. page reclaim bias file pages reclaim
    against anonymous pages. This doesn't make sense for MADV_FREE pages,
    because those pages could be freed easily and refilled with very
    slight penality. Even page reclaim doesn't bias file pages, there is
    still an issue, because MADV_FREE pages and other anonymous pages are
    mixed together. To reclaim a MADV_FREE page, we probably must scan a
    lot of other anonymous pages, which is inefficient. In our test, we
    usually see oom with MADV_FREE enabled and nothing without it.

    - Accounting. There are two accounting problems. We don't have a global
    accounting. If the system is abnormal, we don't know if it's a
    problem from MADV_FREE side. The other problem is RSS accounting.
    MADV_FREE pages are accounted as normal anon pages and reclaimed
    lazily, so application's RSS becomes bigger. This confuses our
    workloads. We have monitoring daemon running and if it finds
    applications' RSS becomes abnormal, the daemon will kill the
    applications even kernel can reclaim the memory easily.

    To address the first the two issues, we can either put MADV_FREE pages
    into a separate LRU list (Minchan's previous patches and V1 patches), or
    put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
    patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
    tiny nowadays and should be full of used once file pages. So we can
    still efficiently reclaim MADV_FREE pages there without interference
    with other anon and active file pages. Putting the pages into inactive
    file list also has an advantage which allows page reclaim to prioritize
    MADV_FREE pages and used once file pages. MADV_FREE pages are put into
    the lru list and clear SwapBacked flag, so PageAnon(page) &&
    !PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
    directly freed without pageout if they are clean, otherwise normal swap
    will reclaim them.

    For the third issue, the previous post adds global accounting and a
    separate RSS count for MADV_FREE pages. The problem is we never get
    accurate accounting for MADV_FREE pages. The pages are mapped to
    userspace, can be dirtied without notice from kernel side. To get
    accurate accounting, we could write protect the page, but then there is
    extra page fault overhead, which people don't want to pay. Jemalloc
    guys have concerns about the inaccurate accounting, so this post drops
    the accounting patches temporarily. The info exported to
    /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
    can get accurate accounting right now.

    This patch (of 6):

    Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
    always have the flag set if we want to do an unmap. For cases we don't
    do an unmap, the TTU_LZFREE part of code should never run.

    Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
    TTU_MIGRATION), an unmap is implied.

    The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
    code

    Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

02 Mar, 2017

2 commits


25 Feb, 2017

1 commit

  • Extend soft offlining framework to support non-lru page, which already
    support migration after commit bda807d44454 ("mm: migrate: support
    non-lru movable page migration")

    When memory corrected errors occur on a non-lru movable page, we can
    choose to stop using it by migrating data onto another page and disable
    the original (maybe half-broken) one.

    Link: http://lkml.kernel.org/r/1485867981-16037-4-git-send-email-ysxie@foxmail.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Suggested-by: Minchan Kim
    Reviewed-by: Minchan Kim
    Acked-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Hanjun Guo
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Taku Izumi
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

26 Dec, 2016

1 commit


12 Nov, 2016

1 commit

  • When memory_failure() runs on a thp tail page after pmd is split, we
    trigger the following VM_BUG_ON_PAGE():

    page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1
    flags: 0x1fffc000400000(hwpoison)
    page dumped because: VM_BUG_ON_PAGE(!page_count(p))
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/memory-failure.c:1132!

    memory_failure() passed refcount and page lock from tail page to head
    page, which is not needed because we can pass any subpage to
    split_huge_page().

    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

29 Jul, 2016

2 commits

  • dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
    let's remove incorrect comment.

    The reason why the page lock is not really needed is that
    dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
    hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
    are just allocated but not linked to active list yet, even without
    taking page lock.

    Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Zhan Chen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 May, 2016

1 commit

  • HWPoison was specific to some particular x86 platforms. And it is often
    seen as high level machine check handler. And therefore, 'MCE' is used
    for the format prefix of printk(). However, 'PowerNV' has also used
    HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
    to memory_failure.c.

    Additionally, 'MCE' and 'Memory failure' have different context. The
    former belongs to exception context and the latter belongs to process
    context. Furthermore, HWPoison can also be used for off-lining those
    sub-health pages that do not trigger any machine check exception.

    This patch aims to replace 'MCE' with a more appropriate prefix.

    [1] commit 75eb3d9b60c2 ("powerpc/powernv: Get FSP memory errors
    and plumb into memory poison infrastructure.")

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     

29 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

1 commit

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

16 Mar, 2016

1 commit


16 Jan, 2016

4 commits

  • Currently memory_failure() doesn't handle non anonymous thp case,
    because we can hardly expect the error handling to be successful, and it
    can just hit some corner case which results in BUG_ON or something
    severe like that. This is also the case for soft offline code, so let's
    make it in the same way.

    Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
    but it's unnecessary because get_any_page() is already called when
    running on this code, which takes a refcount of the target page
    regardress of the flag. So this patch also removes it.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • soft_offline_page() has some deeply indented code, that's the sign of
    demand for cleanup. So let's do this. No functionality change.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
    changes in thp refcounting rule. This patch fixes them up.

    In the new refcounting, we no longer use tail->_mapcount to keep tail's
    refcount, and thereby we can simplify get/put_hwpoison_page().

    And another change is that tail's refcount is not transferred to the raw
    page during thp split (more precisely, in new rule we don't take
    refcount on tail page any more.) So when we need thp split, we have to
    transfer the refcount properly to the 4kB soft-offlined page before
    migration.

    thp split code goes into core code only when precheck
    (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
    split, where we assume that one refcount is held by the caller of thp
    split and the others are taken via mapping. To meet this assumption,
    this patch moves thp split part in soft_offline_page() after
    get_any_page().

    [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • I saw the following BUG_ON triggered in a testcase where a process calls
    madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
    calls migratepages command repeatedly (doing ping-pong among different
    NUMA nodes) for the first process:

    Soft offlining page 0x60000 at 0x700000600000
    __get_any_page: 0x60000 free buddy page
    page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
    flags: 0x1fffc0000000000()
    page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/include/linux/mm.h:342!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
    CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
    RIP: 0010:[] [] put_page+0x5c/0x60
    RSP: 0018:ffff88007c213e00 EFLAGS: 00010246
    Call Trace:
    put_hwpoison_page+0x4e/0x80
    soft_offline_page+0x501/0x520
    SyS_madvise+0x6bc/0x6f0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
    RIP [] put_page+0x5c/0x60
    RSP

    The root cause resides in get_any_page() which retries to get a refcount
    of the page to be soft-offlined. This function calls
    put_hwpoison_page(), expecting that the target page is putback to LRU
    list. But it can be also freed to buddy. So the second check need to
    care about such case.

    Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
    Signed-off-by: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi