19 Oct, 2019

1 commit

  • We should check for pfn_to_online_page() to not access uninitialized
    memmaps. Reshuffle the code so we don't have to duplicate the error
    message.

    Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

15 Oct, 2019

1 commit

  • Mmap /dev/dax more than once, then read the poison location using
    address from one of the mappings. The other mappings due to not having
    the page mapped in will cause SIGKILLs delivered to the process.
    SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
    handle the UE.

    Although one may add MAP_POPULATE to mmap(2) to work around the issue,
    MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
    isn't always an option.

    Details -

    ndctl inject-error --block=10 --count=1 namespace6.0

    ./read_poison -x dax6.0 -o 5120 -m 2
    mmaped address 0x7f5bb6600000
    mmaped address 0x7f3cf3600000
    doing local read at address 0x7f3cf3601400
    Killed

    Console messages in instrumented kernel -

    mce: Uncorrected hardware memory error in user-access at edbe201400
    Memory failure: tk->addr = 7f5bb6601000
    Memory failure: address edbe201: call dev_pagemap_mapping_shift
    dev_pagemap_mapping_shift: page edbe201: no PUD
    Memory failure: tk->size_shift == 0
    Memory failure: Unable to find user space address edbe201 in read_poison
    Memory failure: tk->addr = 7f3cf3601000
    Memory failure: address edbe201: call dev_pagemap_mapping_shift
    Memory failure: tk->size_shift = 21
    Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
    => to deliver SIGKILL
    Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
    => to deliver SIGBUS

    Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com
    Signed-off-by: Jane Chu
    Suggested-by: Naoya Horiguchi
    Reviewed-by: Dan Williams
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jane Chu
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

1 commit

  • Some user who install SIGBUS handler that does longjmp out therefore
    keeping the process alive is confused by the error message

    "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"

    Slightly modify the error message to improve clarity.

    Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.com
    Signed-off-by: Jane Chu
    Acked-by: Naoya Horiguchi
    Acked-by: Pankaj Gupta
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jane Chu
     

09 Jul, 2019

1 commit

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     

03 Jul, 2019

1 commit

  • The code hasn't been used since it was added to the tree, and doesn't
    appear to actually be usable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

29 Jun, 2019

2 commits

  • madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
    for hugepages with overcommitting enabled. That was caused by the
    suboptimal code in current soft-offline code. See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
    MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
    ...
    } else {
    /*
    * We set PG_hwpoison only when the migration source hugepage
    * was successfully dissolved, because otherwise hwpoisoned
    * hugepage remains on free hugepage list, then userspace will
    * find it as SIGBUS by allocation failure. That's not expected
    * in soft-offlining.
    */
    ret = dissolve_free_huge_page(page);
    if (!ret) {
    if (set_hwpoison_free_buddy_page(page))
    num_poisoned_pages_inc();
    }
    }
    return ret;

    Here dissolve_free_huge_page() returns -EBUSY if the migration source page
    was freed into buddy in migrate_pages(), but even in that case we actually
    has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
    current code gives up offlining too early now.

    dissolve_free_huge_page() checks that a given hugepage is suitable for
    dissolving, where we should return success for !PageHuge() case because
    the given hugepage is considered as already dissolved.

    This change also affects other callers of dissolve_free_huge_page(), which
    are cleaned up together.

    [n-horiguchi@ah.jp.nec.com: v3]
    Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Chen, Jerry T
    Tested-by: Chen, Jerry T
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: "Chen, Jerry T"
    Cc: "Zhuo, Qiuxu"
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The pass/fail of soft offline should be judged by checking whether the
    raw error page was finally contained or not (i.e. the result of
    set_hwpoison_free_buddy_page()), but current code do not work like
    that. It might lead us to misjudge the test result when
    set_hwpoison_free_buddy_page() fails.

    Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
    not offline the original page and will not return an error.

    Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: "Chen, Jerry T"
    Cc: "Zhuo, Qiuxu"
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this software may be redistributed and or modified under the terms
    of the gnu general public license gpl version 2 only as published by
    the free software foundation

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

27 May, 2019

1 commit


06 Mar, 2019

1 commit

  • When soft_offline_in_use_page() runs on a thp tail page after pmd is
    split, we trigger the following VM_BUG_ON_PAGE():

    Memory failure: 0x3755ff: non anonymous thp
    __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
    Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
    page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
    flags: 0x2fffff80000000()
    raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:519!

    soft_offline_in_use_page() passed refcount and page lock from tail page
    to head page, which is not needed because we can pass any subpage to
    split_huge_page().

    Naoya had fixed a similar issue in c3901e722b29 ("mm: hwpoison: fix thp
    split handling in memory_failure()"). But he missed fixing soft
    offline.

    Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Signed-off-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhongjiang
     

02 Feb, 2019

1 commit

  • Currently memory_failure() is racy against process's exiting, which
    results in kernel crash by null pointer dereference.

    The root cause is that memory_failure() uses force_sig() to forcibly
    kill asynchronous (meaning not in the current context) processes. As
    discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
    fixes, this is not a right thing to do. OOM solves this issue by using
    do_send_sig_info() as done in commit d2d393099de2 ("signal:
    oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
    patch is suggesting to do the same for hwpoison. do_send_sig_info()
    properly accesses to siglock with lock_task_sighand(), so is free from
    the reported race.

    I confirmed that the reported bug reproduces with inserting some delay
    in kill_procs(), and it never reproduces with this patch.

    Note that memory_failure() can send another type of signal using
    force_sig_mceerr(), and the reported race shouldn't happen on it because
    force_sig_mceerr() is called only for synchronous processes (i.e.
    BUS_MCEERR_AR happens only when some process accesses to the corrupted
    memory.)

    Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Jane Chu
    Reviewed-by: Dan Williams
    Reviewed-by: William Kucharski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

09 Jan, 2019

1 commit

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Dec, 2018

1 commit

  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Dec, 2018

1 commit

  • Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
    store a replacement entry in the Xarray at the given xas-index with the
    DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
    value of the entry relative to the current Xarray state to be specified.

    In most contexts dax_unlock_entry() is operating in the same scope as
    the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
    case the implementation needs to recall the original entry. In the case
    where the original entry is a 'pmd' entry it is possible that the pfn
    performed to do the lookup is misaligned to the value retrieved in the
    Xarray.

    Change the api to return the unlock cookie from dax_lock_page() and pass
    it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
    assuming that the page was PMD-aligned if the entry was a PMD entry with
    signatures like:

    WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
    RIP: 0010:dax_insert_entry+0x2b2/0x2d0
    [..]
    Call Trace:
    dax_iomap_pte_fault.isra.41+0x791/0xde0
    ext4_dax_huge_fault+0x16f/0x1f0
    ? up_read+0x1c/0xa0
    __do_fault+0x1f/0x160
    __handle_mm_fault+0x1033/0x1490
    handle_mm_fault+0x18b/0x3d0

    Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
    Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray")
    Reported-by: Dan Williams
    Signed-off-by: Matthew Wilcox
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Matthew Wilcox
     

26 Aug, 2018

1 commit

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     

24 Aug, 2018

2 commits

  • A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
    allocate a page that was just freed on the way of soft-offline. This is
    undesirable because soft-offline (which is about corrected error) is
    less aggressive than hard-offline (which is about uncorrected error),
    and we can make soft-offline fail and keep using the page for good
    reason like "system is busy."

    Two main changes of this patch are:

    - setting migrate type of the target page to MIGRATE_ISOLATE. As done
    in free_unref_page_commit(), this makes kernel bypass pcplist when
    freeing the page. So we can assume that the page is in freelist just
    after put_page() returns,

    - setting PG_hwpoison on free page under zone->lock which protects
    freelists, so this allows us to avoid setting PG_hwpoison on a page
    that is decided to be allocated soon.

    [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
    Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: soft-offline: fix race against page allocation".

    Xishi recently reported the issue about race on reusing the target pages
    of soft offlining. Discussion and analysis showed that we need make
    sure that setting PG_hwpoison should be done in the right place under
    zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
    hanldes free buddy page's case.

    This patch (of 2):

    There's a race condition between soft offline and hugetlb_fault which
    causes unexpected process killing and/or hugetlb allocation failure.

    The process killing is caused by the following flow:

    CPU 0 CPU 1 CPU 2

    soft offline
    get_any_page
    // find the hugetlb is free
    mmap a hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    // succeed
    soft_offline_free_page
    // set hwpoison flag
    mmap the hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    find_lock_page
    return VM_FAULT_HWPOISON
    mm_fault_error
    do_sigbus
    // kill the process

    The hugetlb allocation failure comes from the following flow:

    CPU 0 CPU 1

    mmap a hugetlb file
    // reserve all free page but don't fault-in
    soft offline
    get_any_page
    // find the hugetlb is free
    soft_offline_free_page
    // set hwpoison flag
    dissolve_free_huge_page
    // fail because all free hugepages are reserved
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    ...
    dequeue_huge_page_node_exact
    // ignore hwpoisoned hugepage
    // and finally fail due to no-mem

    The root cause of this is that current soft-offline code is written based
    on an assumption that PageHWPoison flag should be set at first to avoid
    accessing the corrupted data. This makes sense for memory_failure() or
    hard offline, but does not for soft offline because soft offline is about
    corrected (not uncorrected) error and is safe from data lost. This patch
    changes soft offline semantics where it sets PageHWPoison flag only after
    containment of the error page completes successfully.

    Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Suggested-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

23 Aug, 2018

1 commit

  • page_freeze_refs/page_unfreeze_refs have already been relplaced by
    page_ref_freeze/page_ref_unfreeze , but they are not modified in the
    comments.

    Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
    Signed-off-by: Jiang Biao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Biao
     

24 Jul, 2018

2 commits

  • mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered

    In contrast to typical memory, dev_pagemap pages may be dax mapped. With
    dax there is no possibility to map in another page dynamically since dax
    establishes 1:1 physical address to file offset associations. Also
    dev_pagemap pages associated with NVDIMM / persistent memory devices can
    internal remap/repair addresses with poison. While memory_failure()
    assumes that it can discard typical poisoned pages and keep them
    unmapped indefinitely, dev_pagemap pages may be returned to service
    after the error is cleared.

    Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
    dev_pagemap pages that have poison consumed by userspace. Mark the
    memory as UC instead of unmapping it completely to allow ongoing access
    via the device driver (nd_pmem). Later, nd_pmem will grow support for
    marking the page back to WB when the error is cleared.

    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Jérôme Glisse
    Cc: Matthew Wilcox
    Cc: Naoya Horiguchi
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     
  • In preparation for supporting memory_failure() for dax mappings, teach
    collect_procs() to also determine the mapping size. Unlike typical
    mappings the dax mapping size is determined by walking page-table
    entries rather than using the compound-page accounting for THP pages.

    Acked-by: Naoya Horiguchi
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     

21 Jul, 2018

1 commit


12 Apr, 2018

1 commit

  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Apr, 2018

1 commit

  • Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

13 Feb, 2018

1 commit

  • In the following commit:

    ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

    ... we added code to memory_failure() to unmap the page from the
    kernel 1:1 virtual address space to avoid speculative access to the
    page logging additional errors.

    But memory_failure() may not always succeed in taking the page offline,
    especially if the page belongs to the kernel. This can happen if
    there are too many corrected errors on a page and either mcelog(8)
    or drivers/ras/cec.c asks to take a page offline.

    Since we remove the 1:1 mapping early in memory_failure(), we can
    end up with the page unmapped, but still in use. On the next access
    the kernel crashes :-(

    There are also various debug paths that call memory_failure() to simulate
    occurrence of an error. Since there is no actual error in memory, we
    don't need to map out the page for those cases.

    Revert most of the previous attempt and keep the solution local to
    arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

    1) there is a real error
    2) memory_failure() succeeds.

    All of this only applies to 64-bit systems. 32-bit kernel doesn't map
    all of memory into kernel space. It isn't worth adding the code to unmap
    the piece that is mapped because nobody would run a 32-bit kernel on a
    machine that has recoverable machine checks.

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave
    Cc: Denys Vlasenko
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Robert (Persistent Memory)
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org #v4.14
    Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
    Signed-off-by: Ingo Molnar

    Tony Luck
     

24 Jan, 2018

2 commits


16 Nov, 2017

1 commit

  • On a failed attempt, we get the following entry: soft offline: 0x3c0000:
    migration failed 1, type 17ffffc0008008 (uptodate|head)

    Make this more specific to be straightforward and to follow other error
    log formats in soft_offline_huge_page().

    Link: http://lkml.kernel.org/r/20171016171757.GA3018@ubuntu-desk-vm
    Signed-off-by: Laszlo Toth
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laszlo Toth
     

17 Aug, 2017

1 commit

  • Speculative processor accesses may reference any memory that has a
    valid page table entry. While a speculative access won't generate
    a machine check, it will log the error in a machine check bank. That
    could cause escalation of a subsequent error since the overflow bit
    will be then set in the machine check bank status register.

    Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
    address of the page we want to map out otherwise we may trigger the
    very problem we are trying to avoid. We use a non-canonical address
    that passes through the usual Linux table walking code to get to the
    same "pte".

    Thanks to Dave Hansen for reviewing several iterations of this.

    Also see:

    http://marc.info/?l=linux-mm&m=149860136413338&w=2

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Elliott, Robert (Persistent Memory)
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
    Signed-off-by: Ingo Molnar

    Tony Luck
     

11 Jul, 2017

9 commits

  • new_page is yet another duplication of the migration callback which has
    to handle hugetlb migration specially. We can safely use the generic
    new_page_nodemask for the same purpose.

    Please note that gigantic hugetlb pages do not need any special handling
    because alloc_huge_page_nodemask will make sure to check pages in all
    per node pools. The reason this was done previously was that
    alloc_huge_page_node treated NO_NUMA_NODE and a specific node
    differently and so alloc_huge_page_node(nid) would check on this
    specific node.

    Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reported-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Factoring duplicate code into a function.

    Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

    Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
    keep the error hugepage away from the system, which is OK but not good
    enough because the hugepage still has a refcount and unpoison doesn't
    work on the error hugepage (PageHWPoison flags are cleared but pages are
    still leaked.) And there's "wasting health subpages" issue too. This
    patch reworks on me_huge_page() to solve these issues.

    For hugetlb file, recently we have truncating code so let's use it in
    hugetlbfs specific ->error_remove_page().

    For anonymous hugepage, it's helpful to dissolve the error page after
    freeing it into free hugepage list. Migration entry and PageHWPoison in
    the head page prevent the access to it.

    TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
    It's not critical (and at least no worse that now) because in such case
    the error hugepage just stays in free hugepage list without being
    dissolved. By virtue of PageHWPoison in head page, it's never allocated
    to processes.

    [akpm@linux-foundation.org: fix unused var warnings]
    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() is a big function and hard to maintain. Handling
    hugetlb- and non-hugetlb- case in a single function is not good, so this
    patch separates PageHuge() branch into a new function, which saves many
    PageHuge() check.

    Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Now we have code to rescue most of healthy pages from a hwpoisoned
    hugepage. So let's apply it to soft_offline_free_page too.

    Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migrated by soft-offline (i.e. due to correctable
    memory errors) is contained as a hugepage, which means many non-error
    pages in it are unreusable, i.e. wasted.

    This patch solves this issue by dissolving source hugepages into buddy.
    As done in previous patch, PageHWPoison is set only on a head page of
    the error hugepage. Then in dissoliving we move the PageHWPoison flag
    to the raw error page so that all healthy subpages return back to buddy.

    [arnd@arndb.de: fix warnings: replace some macros with inline functions]
    Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • We'd like to narrow down the error region in memory error on hugetlb
    pages. However, currently we set PageHWPoison flags on all subpages in
    the error hugepage and add # of subpages to num_hwpoison_pages, which
    doesn't fit our purpose.

    So this patch changes the behavior and we only set PageHWPoison on the
    head page then increase num_hwpoison_pages only by 1. This is a
    preparation for narrow-down part which comes in later patches.

    Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: hwpoison: fixlet for hugetlb migration".

    This patchset updates the hwpoison/hugetlb code to address 2 reported
    issues.

    One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
    http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
    half was already fixed in mainline, and another half about hugetlb cases
    are solved in this series.

    Another issue is "narrow-down error affected region into a single 4kB
    page instead of a whole hugetlb page" issue, which was tried by Anshuman
    (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
    and I updated it to apply it more widely.

    This patch (of 9):

    We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
    as we did before. So current dequeue_huge_page_node() doesn't work as
    intended because it still uses is_migrate_isolate_page() for this check.
    This patch fixes it with PageHWPoison flag.

    Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Though migrating gigantic HugeTLB pages does not sound much like real
    world use case, they can be affected by memory errors. Hence migration
    at the PGD level HugeTLB pages should be supported just to enable soft
    and hard offline use cases.

    While allocating the new gigantic HugeTLB page, it should not matter
    whether new page comes from the same node or not. There would be very
    few gigantic pages on the system afterall, we should not be bothered
    about node locality when trying to save a big page from crashing.

    This change renames dequeu_huge_page_node() function as dequeue_huge
    _page_node_exact() preserving it's original functionality. Now the new
    dequeue_huge_page_node() function scans through all available online nodes
    to allocate a huge page for the NUMA_NO_NODE case and just falls back
    calling dequeu_huge_page_node_exact() for all other cases.

    [arnd@arndb.de: make hstate_is_gigantic() inline]
    Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Arnd Bergmann
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual