08 Dec, 2018

1 commit

  • commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.

    Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

21 Nov, 2018

1 commit

  • commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.

    This bug has been experienced several times by the Oracle DB team. The
    BUG is in remove_inode_hugepages() as follows:

    /*
    * If page is mapped, it was faulted in after being
    * unmapped in caller. Unmap (again) now after taking
    * the fault mutex. The mutex will prevent faults
    * until we finish removing the page.
    *
    * This race can only happen in the hole punch case.
    * Getting here in a truncate operation is a bug.
    */
    if (unlikely(page_mapped(page))) {
    BUG_ON(truncate_op);

    In this case, the elevated map count is not the result of a race.
    Rather it was incorrectly incremented as the result of a bug in the huge
    pmd sharing code. Consider the following:

    - Process A maps a hugetlbfs file of sufficient size and alignment
    (PUD_SIZE) that a pmd page could be shared.

    - Process B maps the same hugetlbfs file with the same size and
    alignment such that a pmd page is shared.

    - Process B then calls mprotect() to change protections for the mapping
    with the shared pmd. As a result, the pmd is 'unshared'.

    - Process B then calls mprotect() again to chage protections for the
    mapping back to their original value. pmd remains unshared.

    - Process B then forks and process C is created. During the fork
    process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
    tables. Copying page tables for hugetlb mappings is done in the
    routine copy_hugetlb_page_range.

    In copy_hugetlb_page_range(), the destination pte is obtained by:

    dst_pte = huge_pte_alloc(dst, addr, sz);

    If pmd sharing is possible, the returned pointer will be to a pte in an
    existing page table. In the situation above, process C could share with
    either process A or process B. Since process A is first in the list,
    the returned pte is a pointer to a pte in process A's page table.

    However, the check for pmd sharing in copy_hugetlb_page_range is:

    /* If the pagetables are shared don't copy or take references */
    if (dst_pte == src_pte)
    continue;

    Since process C is sharing with process A instead of process B, the
    above test fails. The code in copy_hugetlb_page_range which follows
    assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
    from src_pte to dst_pte and increments this map count of the associated
    page. This is how we end up with an elevated map count.

    To solve, check the dst_pte entry for huge_pte_none. If !none, this
    implies PMD sharing so do not copy.

    Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
    Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

14 Nov, 2018

1 commit

  • commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.

    Some test systems were experiencing negative huge page reserve counts and
    incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
    removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
    explicit code removes the pages, the appropriate accounting is not
    performed.

    This can be recreated as follows:
    fallocate -l 2M /dev/hugepages/foo
    echo 1 > /proc/sys/vm/drop_caches
    fallocate -l 2M /dev/hugepages/foo
    grep -i huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 2048
    HugePages_Free: 2047
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 4194304 kB
    ls -lsh /dev/hugepages/foo
    4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

    To address this issue, dirty pages as they are added to pagecache. This
    can easily be reproduced with fallocate as shown above. Read faulted
    pages will eventually end up being marked dirty. But there is a window
    where they are clean and could be impacted by code such as drop_caches.
    So, just dirty them all as they are added to the pagecache.

    Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
    Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
    Signed-off-by: Mike Kravetz
    Acked-by: Mihcla Hocko
    Reviewed-by: Khalid Aziz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

13 Oct, 2018

1 commit

  • commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.

    The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

11 Jul, 2018

1 commit

  • commit 520495fe96d74e05db585fc748351e0504d8f40d upstream.

    When booting with very large numbers of gigantic (i.e. 1G) pages, the
    operations in the loop of gather_bootmem_prealloc, and specifically
    prep_compound_gigantic_page, takes a very long time, and can cause a
    softlockup if enough pages are requested at boot.

    For example booting with 3844 1G pages requires prepping
    (set_compound_head, init the count) over 1 billion 4K tail pages, which
    takes considerable time.

    Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
    prevent this lockup.

    Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
    no softlockup is reported, and the hugepages are reported as
    successfully setup.

    Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Cannon Matthews
     

29 Mar, 2018

1 commit

  • commit 63489f8e821144000e0bdca7e65a8d1cc23a7ee7 upstream.

    A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

05 Dec, 2017

2 commits

  • commit f4f0a3d85b50a65a348e2b8635041d6b30f01deb upstream.

    I made a mistake during converting hugetlb code to 5-level paging: in
    huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().

    Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
    if p4d table is not yet allocated.

    It only can happen in 5-level paging mode. In 4-level paging mode
    p4d_offset() always returns pgd, so we are fine.

    Link: http://lkml.kernel.org/r/20171122121921.64822-1-kirill.shutemov@linux.intel.com
    Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 31383c6865a578834dd953d9dbc88e6b19fe3997 upstream.

    Patch series "device-dax: fix unaligned munmap handling"

    When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and fail attempts to split vmas into unaligned ranges. It
    would be messy to teach the munmap path about device-dax alignment
    constraints in the same (hstate) way that hugetlbfs communicates this
    constraint. Instead, these patches introduce a new ->split() vm
    operation.

    This patch (of 2):

    The device-dax interface has similar constraints as hugetlbfs in that it
    requires the munmap path to unmap in huge page aligned units. Rather
    than add more custom vma handling code in __split_vma() introduce a new
    vm operation to perform this vma specific check.

    Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

03 Nov, 2017

1 commit

  • This oops:

    kernel BUG at fs/hugetlbfs/inode.c:484!
    RIP: remove_inode_hugepages+0x3d0/0x410
    Call Trace:
    hugetlbfs_setattr+0xd9/0x130
    notify_change+0x292/0x410
    do_truncate+0x65/0xa0
    do_sys_ftruncate.constprop.3+0x11a/0x180
    SyS_ftruncate+0xe/0x10
    tracesys+0xd9/0xde

    was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte.

    mmap() can still succeed beyond the end of the i_size after vmtruncate
    zapped vmas in those ranges, but the faults must not succeed, and that
    includes UFFDIO_COPY.

    We could differentiate the retval to userland to represent a SIGBUS like
    a page fault would do (vs SIGSEGV), but it doesn't seem very useful and
    we'd need to pick a random retval as there's no meaningful syscall
    retval that would differentiate from SIGSEGV and SIGBUS, there's just
    -EFAULT.

    Link: http://lkml.kernel.org/r/20171016223914.2421-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Kravetz
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Sep, 2017

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Nothing really major this release, despite quite a lot of activity.
    Just lots of things all over the place.

    Some things of note include:

    - Access via perf to a new type of PMU (IMC) on Power9, which can
    count both core events as well as nest unit events (Memory
    controller etc).

    - Optimisations to the radix MMU TLB flushing, mostly to avoid
    unnecessary Page Walk Cache (PWC) flushes when the structure of the
    tree is not changing.

    - Reworks/cleanups of do_page_fault() to modernise it and bring it
    closer to other architectures where possible.

    - Rework of our page table walking so that THP updates only need to
    send IPIs to CPUs where the affected mm has run, rather than all
    CPUs.

    - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
    systems. This avoids problems with the percpu allocator on systems
    with very sparse NUMA layouts.

    - STRICT_KERNEL_RWX support on PPC32.

    - A new sched domain topology for Power9, to capture the fact that
    pairs of cores may share an L2 cache.

    - Power9 support for VAS, which is a new mechanism for accessing
    coprocessors, and initial support for using it with the NX
    compression accelerator.

    - Major work on the instruction emulation support, adding support for
    many new instructions, and reworking it so it can be used to
    implement the emulation needed to fixup alignment faults.

    - Support for guests under PowerVM to use the Power9 XIVE interrupt
    controller.

    And probably that many things again that are almost as interesting,
    but I had to keep the list short. Plus the usual fixes and cleanups as
    always.

    Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
    Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
    Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
    Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
    Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
    Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
    LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
    Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
    Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
    Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
    Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
    Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

    * tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
    powerpc/xive: Fix section __init warning
    powerpc: Fix kernel crash in emulation of vector loads and stores
    powerpc/xive: improve debugging macros
    powerpc/xive: add XIVE Exploitation Mode to CAS
    powerpc/xive: introduce H_INT_ESB hcall
    powerpc/xive: add the HW IRQ number under xive_irq_data
    powerpc/xive: introduce xive_esb_write()
    powerpc/xive: rename xive_poke_esb() in xive_esb_read()
    powerpc/xive: guest exploitation of the XIVE interrupt controller
    powerpc/xive: introduce a common routine xive_queue_page_alloc()
    powerpc/sstep: Avoid used uninitialized error
    axonram: Return directly after a failed kzalloc() in axon_ram_probe()
    axonram: Improve a size determination in axon_ram_probe()
    axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
    powerpc/powernv/npu: Move tlb flush before launching ATSD
    powerpc/macintosh: constify wf_sensor_ops structures
    powerpc/iommu: Use permission-specific DEVICE_ATTR variants
    powerpc/eeh: Delete an error out of memory message at init time
    powerpc/mm: Use seq_putc() in two functions
    macintosh: Convert to using %pOF instead of full_name
    ...

    Linus Torvalds
     

07 Sep, 2017

3 commits

  • alloc_gigantic_page doesn't consider movability of the gigantic hugetlb
    when scanning eligible ranges for the allocation. As 1GB hugetlb pages
    are not movable currently this can break the movable zone assumption
    that all allocations are migrateable and as such break memory hotplug.

    Reorganize the code and use the standard zonelist allocations scheme
    that we use for standard hugetbl pages. htlb_alloc_mask will ensure
    that only migratable hugetlb pages will ever see a movable zone.

    Link: http://lkml.kernel.org/r/20170803083549.21407-1-mhocko@kernel.org
    Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime")
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: Luiz Capitulino
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • attribute_group are not supposed to change at runtime. All functions
    working with attribute_group provided by work with const
    attribute_group. So mark the non-const structs as const.

    Link: http://lkml.kernel.org/r/1501157260-3922-1-git-send-email-arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     
  • When walking the page tables to resolve an address that points to
    !p*d_present() entry, huge_pte_offset() returns inconsistent values
    depending on the level of page table (PUD or PMD).

    It returns NULL in the case of a PUD entry while in the case of a PMD
    entry, it returns a pointer to the page table entry.

    A similar inconsitency exists when handling swap entries - returns NULL
    for a PUD entry while a pointer to the pte_t is retured for the PMD
    entry.

    Update huge_pte_offset() to make the behaviour consistent - return a
    pointer to the pte_t for hugepage or swap entries. Only return NULL in
    instances where we have a p*d_none() entry and the size parameter
    doesn't match the hugepage size at this level of the page table.

    Document the behaviour to clarify the expected behaviour of this
    function. This is to set clear semantics for architecture specific
    implementations of huge_pte_offset().

    Discussions on the arm64 implementation of huge_pte_offset()
    (http://www.spinics.net/lists/linux-mm/msg133699.html) showed that there
    is benefit from returning a pte_t* in the case of p*d_none().

    The fault handling code in hugetlb_fault() can handle p*d_none() entries
    and saves an extra round trip to huge_pte_alloc(). Other callers of
    huge_pte_offset() should be ok as well.

    [punit.agrawal@arm.com: v2]
    Link: http://lkml.kernel.org/r/20170725154114.24131-2-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Reviewed-by: Catalin Marinas
    Reviewed-by: Mike Kravetz
    Reviewed-by: Catalin Marinas
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Will Deacon
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

15 Aug, 2017

1 commit

  • When running in guest mode ppc64 supports a different mechanism for hugetlb
    allocation/reservation. The LPAR management application called HMC can
    be used to reserve a set of hugepages and we pass the details of
    reserved pages via device tree to the guest. (more details in
    htab_dt_scan_hugepage_blocks()) . We do the memblock_reserve of the range
    and later in the boot sequence, we add the reserved range to huge_boot_pages.

    But to enable 16G hugetlb on baremetal config (when we are not running as guest)
    we want to do memblock reservation during boot. Generic code already does this

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     

11 Aug, 2017

1 commit

  • huge_add_to_page_cache->add_to_page_cache implicitly unlocks the page
    before returning in case of errors.

    The error returned was -EEXIST by running UFFDIO_COPY on a non-hole
    offset of a VM_SHARED hugetlbfs mapping. It was an userland bug that
    triggered it and the kernel must cope with it returning -EEXIST from
    ioctl(UFFDIO_COPY) as expected.

    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    kernel BUG at mm/filemap.c:964!
    invalid opcode: 0000 [#1] SMP
    CPU: 1 PID: 22582 Comm: qemu-system-x86 Not tainted 4.11.11-300.fc26.x86_64 #1
    RIP: unlock_page+0x4a/0x50
    Call Trace:
    hugetlb_mcopy_atomic_pte+0xc0/0x320
    mcopy_atomic+0x96f/0xbe0
    userfaultfd_ioctl+0x218/0xe90
    do_vfs_ioctl+0xa5/0x600
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x1a/0xa9

    Link: http://lkml.kernel.org/r/20170802165145.22628-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Tested-by: Maxime Coquelin
    Reviewed-by: Mike Kravetz
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Rapoport
    Cc: Alexey Perevalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

03 Aug, 2017

1 commit

  • Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
    FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
    errors from follow_hugetlb_page. After such error, __get_user_pages
    subsequently calls faultin_page on the same VMA and start address that
    follow_hugetlb_page failed on instead of returning the error immediately
    as it should.

    In follow_hugetlb_page, when hugetlb_fault returns a value covered under
    VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
    to 0 as __get_user_pages expects in this case, which causes the
    following to happen in __get_user_pages: the "while (nr_pages)" check
    succeeds, we skip the "if (!vma..." check because we got a VMA the last
    time around, we find no page with follow_page_mask, and we call
    faultin_page, which calls hugetlb_fault for the second time.

    This issue also slightly changes how __get_user_pages works. Before, it
    only returned error if it had made no progress (i = 0). But now,
    follow_hugetlb_page can clobber "i" with an error code since its new
    return path doesn't check for progress. So if "i" is nonzero before a
    failing call to follow_hugetlb_page, that indication of progress is lost
    and __get_user_pages can return error even if some pages were
    successfully pinned.

    To fix this, change follow_hugetlb_page so that it updates nr_pages,
    allowing __get_user_pages to fail immediately and restoring the "error
    only if no progress" behavior to __get_user_pages.

    Tested that __get_user_pages returns when expected on error from
    hugetlb_fault in follow_hugetlb_page.

    Fixes: 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
    Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com
    Signed-off-by: Daniel Jordan
    Acked-by: Punit Agrawal
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Gerald Schaefer
    Cc: James Morse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: zhong jiang
    Cc: [4.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jul, 2017

9 commits

  • alloc_huge_page_nodemask tries to allocate from any numa node in the
    allowed node mask starting from lower numa nodes. This might lead to
    filling up those low NUMA nodes while others are not used. We can
    reduce this risk by introducing a concept of the preferred node similar
    to what we have in the regular page allocator. We will start allocating
    from the preferred nid and then iterate over all allowed nodes in the
    zonelist order until we try them all.

    This is mimicing the page allocator logic except it operates on per-node
    mempools. dequeue_huge_page_vma already does this so distill the
    zonelist logic into a more generic dequeue_huge_page_nodemask and use it
    in alloc_huge_page_nodemask.

    This will allow us to use proper per numa distance fallback also for
    alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
    can get rid of alloc_huge_page_node helper which doesn't have any user
    anymore.

    Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm, hugetlb: allow proper node fallback dequeue".

    While working on a hugetlb migration issue addressed in a separate
    patchset[1] I have noticed that the hugetlb allocations from the
    preallocated pool are quite subotimal.

    [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org

    There is no fallback mechanism implemented and no notion of preferred
    node. I have tried to work around it but Vlastimil was right to push
    back for a more robust solution. It seems that such a solution is to
    reuse zonelist approach we use for the page alloctor.

    This series has 3 patches. The first one tries to make hugetlb
    allocation layers more clear. The second one implements the zonelist
    hugetlb pool allocation and introduces a preferred node semantic which
    is used by the migration callbacks. The last patch is a clean up.

    This patch (of 3):

    Hugetlb allocation path for fresh huge pages is unnecessarily complex
    and it mixes different interfaces between layers.

    __alloc_buddy_huge_page is the central place to perform a new
    allocation. It checks for the hugetlb overcommit and then relies on
    __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is
    all good except that __alloc_buddy_huge_page pushes vma and address down
    the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
    two different allocation modes - one for memory policy and other node
    specific (or to make it more obscure node non-specific) requests.

    This just screams for a reorganization.

    This patch pulls out all the vma specific handling up to
    __alloc_buddy_huge_page_with_mpol where it belongs.
    __alloc_buddy_huge_page will get nodemask argument and
    __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
    page allocator.

    In short:
    __alloc_buddy_huge_page_with_mpol - memory policy handling
    __alloc_buddy_huge_page - overcommit handling and accounting
    __hugetlb_alloc_buddy_huge_page - page allocator layer

    Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
    is not really needed because the page allocator already handles the
    cpusets update.

    Finally __hugetlb_alloc_buddy_huge_page had a special case for node
    specific allocations (when no policy is applied and there is a node
    given). This has relied on __GFP_THISNODE to not fallback to a different
    node. alloc_huge_page_node is the only caller which relies on this
    behavior so move the __GFP_THISNODE there.

    Not only does this remove quite some code it also should make those
    layers easier to follow and clear wrt responsibilities.

    Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The hugetlb code has its own function to report human-readable sizes.
    Convert it to use the shared string_get_size() function. This will lead
    to a minor difference in user visible output (MiB/GiB instead of MB/GB),
    but some would argue that's desirable anyway.

    Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: Liam R. Howlett
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Cc: Gerald Schaefer
    Cc: zhong jiang
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • A few hugetlb allocators loop while calling the page allocator and can
    potentially prevent rescheduling if the page allocator slowpath is not
    utilized.

    Conditionally schedule when large numbers of hugepages can be allocated.

    Anshuman:
    "Fixes a task which was getting hung while writing like 10000 hugepages
    (16MB on POWER8) into /proc/sys/vm/nr_hugepages."

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Mike Kravetz
    Tested-by: Anshuman Khandual
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • new_node_page will try to use the origin's next NUMA node as the
    migration destination for hugetlb pages. If such a node doesn't have
    any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
    to allocate a surplus page instead. This is quite subotpimal for any
    configuration when hugetlb pages are no distributed to all NUMA nodes
    evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
    node 0

    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

    Now we consume the whole pool on node 4 and try to offline this node.
    All the allocated pages should be moved to node0 which has enough
    preallocated pages to hold them. With the current implementation
    offlining very likely fails because hugetlb allocations during runtime
    are much less reliable.

    Fix this by reusing the nodemask which excludes migration source and try
    to find a first node which has a page in the preallocated pool first and
    fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
    consumed.

    [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
    Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When the user specifies too many hugepages or an invalid
    default_hugepagesz the communication to the user is implicit in the
    allocation message. This patch adds a warning when the desired page
    count is not allocated and prints an error when the default_hugepagesz
    is invalid on boot.

    During boot hugepages will allocate until there is a fraction of the
    hugepage size left. That is, we allocate until either the request is
    satisfied or memory for the pages is exhausted. When memory for the
    pages is exhausted, it will most likely lead to the system failing with
    the OOM manager not finding enough (or anything) to kill (unless you're
    using really big hugepages in the order of 100s of MB or in the GBs).
    The user will most likely see the OOM messages much later in the boot
    sequence than the implicitly stated message. Worse yet, you may even
    get an OOM for each processor which causes many pages of OOMs on modern
    systems. Although these messages will be printed earlier than the OOM
    messages, at least giving the user errors and warnings will highlight
    the configuration as an issue. I'm trying to point the user in the
    right direction by providing a more robust statement of what is failing.

    During the sysctl or echo command, the user can check the results much
    easier than if the system hangs during boot and the scenario of having
    nothing to OOM for kernel memory is highly unlikely.

    Mike said:
    "Before sending out this patch, I asked Liam off list why he was doing
    it. Was it something he just thought would be useful? Or, was there
    some type of user situation/need. He said that he had been called in
    to assist on several occasions when a system OOMed during boot. In
    almost all of these situations, the user had grossly misconfigured
    huge pages.

    DB users want to pre-allocate just the right amount of huge pages, but
    sometimes they can be really off. In such situations, the huge page
    init code just allocates as many huge pages as it can and reports the
    number allocated. There is no indication that it quit allocating
    because it ran out of memory. Of course, a user could compare the
    number in the message to what they requested on the command line to
    determine if they got all the huge pages they requested. The thought
    was that it would be useful to at least flag this situation. That way,
    the user might be able to better relate the huge page allocation
    failure to the OOM.

    I'm not sure if the e-mail discussion made it obvious that this is
    something he has seen on several occasions.

    I see Michal's point that this will only flag the situation where
    someone configures huge pages very badly. And, a more extensive look
    at the situation of misconfiguring huge pages might be in order. But,
    this has happened on several occasions which led to the creation of
    this patch"

    [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
    Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
    Signed-off-by: Liam R. Howlett
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Cc: Gerald Schaefer
    Cc: zhongjiang
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

    Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migrated by soft-offline (i.e. due to correctable
    memory errors) is contained as a hugepage, which means many non-error
    pages in it are unreusable, i.e. wasted.

    This patch solves this issue by dissolving source hugepages into buddy.
    As done in previous patch, PageHWPoison is set only on a head page of
    the error hugepage. Then in dissoliving we move the PageHWPoison flag
    to the raw error page so that all healthy subpages return back to buddy.

    [arnd@arndb.de: fix warnings: replace some macros with inline functions]
    Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Patch series "mm: hwpoison: fixlet for hugetlb migration".

    This patchset updates the hwpoison/hugetlb code to address 2 reported
    issues.

    One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
    http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
    half was already fixed in mainline, and another half about hugetlb cases
    are solved in this series.

    Another issue is "narrow-down error affected region into a single 4kB
    page instead of a whole hugetlb page" issue, which was tried by Anshuman
    (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
    and I updated it to apply it more widely.

    This patch (of 9):

    We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
    as we did before. So current dequeue_huge_page_node() doesn't work as
    intended because it still uses is_migrate_isolate_page() for this check.
    This patch fixes it with PageHWPoison flag.

    Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Jul, 2017

9 commits

  • The main allocator function __alloc_pages_nodemask() takes a zonelist
    pointer as one of its parameters. All of its callers directly or
    indirectly obtain the zonelist via node_zonelist() using a preferred
    node id and gfp_mask. We can make the code a bit simpler by doing the
    zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
    id instead (gfp_mask is already another parameter).

    There are some code size benefits thanks to removal of inlined
    node_zonelist():

    bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

    This will also make things simpler if we proceed with converting cpusets
    to zonelists.

    Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Dimitri Sivanich
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • set_huge_pte_at(), an architecture callback to populate hugepage ptes,
    does not provide the range of virtual memory that is targeted. This
    leads to ambiguity when dealing with swap entries on architectures that
    support hugepages consisting of contiguous ptes.

    Fix the problem by introducing an overridable helper that is called when
    populating the page tables with swap entries. The size of the targeted
    region is provided to the helper to help determine the number of entries
    to be updated.

    Provide a default implementation that maintains the current behaviour.

    [punit.agrawal@arm.com: v4]
    Link: http://lkml.kernel.org/r/20170524115409.31309-8-punit.agrawal@arm.com
    [punit.agrawal@arm.com: add an empty definition for set_huge_swap_pte_at()]
    Link: http://lkml.kernel.org/r/20170525171331.31469-1-punit.agrawal@arm.com
    Link: http://lkml.kernel.org/r/20170522133604.11392-6-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K.V"
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Mark Rutland
    Cc: Hillf Danton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • When unmapping a hugepage range, huge_pte_clear() is used to clear the
    page table entries that are marked as not present. huge_pte_clear()
    internally just ends up calling pte_clear() which does not correctly
    deal with hugepages consisting of contiguous page table entries.

    Add a size argument to address this issue and allow architectures to
    override huge_pte_clear() by wrapping it in a #ifndef block.

    Update s390 implementation with the size parameter as well.

    Note that the change only affects huge_pte_clear() - the other generic
    hugetlb functions don't need any change.

    Link: http://lkml.kernel.org/r/20170522162555.4313-1-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Martin Schwidefsky [s390 bits]
    Cc: Heiko Carstens
    Cc: Arnd Bergmann
    Cc: "Aneesh Kumar K.V"
    Cc: Mike Kravetz
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Steve Capper
    Cc: Mark Rutland
    Cc: Hillf Danton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • This moves the #ifdef in C code to a Kconfig dependency. Also we move
    the gigantic_page_supported() function to be arch specific.

    This allows architectures to conditionally enable runtime allocation of
    gigantic huge page. Architectures like ppc64 supports different
    gigantic huge page size (16G and 1G) based on the translation mode
    selected. This provides an opportunity for ppc64 to enable runtime
    allocation only w.r.t 1G hugepage.

    No functional change in this patch.

    Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman (powerpc)
    Cc: Anshuman Khandual
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64 supports hugepage size that is not mapped to
    any of of the page table levels. Instead they add an alternate page
    table entry format called hugepage directory (hugepd). hugepd indicates
    that the page table entry maps to a set of hugetlb pages. Add support
    for this in generic follow_page_mask code. We already support this
    format in the generic gup code.

    The default implementation prints warning and returns NULL. We will add
    ppc64 support in later patches

    Link: http://lkml.kernel.org/r/1494926612-23928-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Anshuman Khandual
    Cc: Naoya Horiguchi
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd
    entries to follow_page_mask so that ppc64 can switch to it to handle
    hugetlbe entries.

    Link: http://lkml.kernel.org/r/1494926612-23928-5-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Aneesh Kumar K.V
    Cc: Naoya Horiguchi
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • We will be using this later from the ppc64 code. Change the return type
    to bool.

    Link: http://lkml.kernel.org/r/1494926612-23928-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Though migrating gigantic HugeTLB pages does not sound much like real
    world use case, they can be affected by memory errors. Hence migration
    at the PGD level HugeTLB pages should be supported just to enable soft
    and hard offline use cases.

    While allocating the new gigantic HugeTLB page, it should not matter
    whether new page comes from the same node or not. There would be very
    few gigantic pages on the system afterall, we should not be bothered
    about node locality when trying to save a big page from crashing.

    This change renames dequeu_huge_page_node() function as dequeue_huge
    _page_node_exact() preserving it's original functionality. Now the new
    dequeue_huge_page_node() function scans through all available online nodes
    to allocate a huge page for the NUMA_NO_NODE case and just falls back
    calling dequeu_huge_page_node_exact() for all other cases.

    [arnd@arndb.de: make hstate_is_gigantic() inline]
    Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Arnd Bergmann
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Jun, 2017

1 commit

  • KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
    FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
    finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
    special case. (check_user_page_hwpoison())

    When huge pages are involved, this doesn't work so well.
    get_user_pages() calls follow_hugetlb_page(), which stops early if it
    receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
    -EFAULT to the caller. The step to map this to -EHWPOISON based on the
    FOLL_ flags is missing. The hwpoison special case is skipped, and
    -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.

    Instead, move this VM_FAULT_ to errno mapping code into a header file
    and use it from faultin_page() and follow_hugetlb_page().

    With this, KVM works as expected.

    This isn't a problem for arm64 today as we haven't enabled
    MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
    too, so I think this should be a fix. This doesn't apply earlier than
    stable's v4.11.1 due to all sorts of cleanup.

    [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
    suggested.
    Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: [4.11.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

01 Apr, 2017

2 commits

  • Changes to hugetlbfs reservation maps is a two step process. The first
    step is a call to region_chg to determine what needs to be changed, and
    prepare that change. This should be followed by a call to call to
    region_add to commit the change, or region_abort to abort the change.

    The error path in hugetlb_reserve_pages called region_abort after a
    failed call to region_chg. As a result, the adds_in_progress counter in
    the reservation map is off by 1. This is caught by a VM_BUG_ON in
    resv_map_release when the reservation map is freed.

    syzkaller fuzzer (when using an injected kmalloc failure) found this
    bug, that resulted in the following:

    kernel BUG at mm/hugetlb.c:742!
    Call Trace:
    hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
    evict+0x481/0x920 fs/inode.c:553
    iput_final fs/inode.c:1515 [inline]
    iput+0x62b/0xa20 fs/inode.c:1542
    hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742

    Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • I found the race condition which triggers the following bug when
    move_pages() and soft offline are called on a single hugetlb page
    concurrently.

    Soft offlining page 0x119400 at 0x700000000000
    BUG: unable to handle kernel paging request at ffffea0011943820
    IP: follow_huge_pmd+0x143/0x190
    PGD 7ffd2067
    PUD 7ffd1067
    PMD 0
    [61163.582052] Oops: 0000 [#1] SMP
    Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
    CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P OE 4.11.0-rc2-mm1+ #2
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:follow_huge_pmd+0x143/0x190
    RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
    RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
    RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
    RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
    R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
    R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
    FS: 00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
    Call Trace:
    follow_page_mask+0x270/0x550
    SYSC_move_pages+0x4ea/0x8f0
    SyS_move_pages+0xe/0x10
    do_syscall_64+0x67/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fc976e03949
    RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
    RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
    RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
    R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
    R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
    Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
    RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
    CR2: ffffea0011943820
    ---[ end trace e4f81353a2d23232 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

    This bug is triggered when pmd_present() returns true for non-present
    hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
    Using pmd_present() to determine present/non-present for hugetlb is not
    correct, because pmd_present() checks multiple bits (not only
    _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

    Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
    Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

10 Mar, 2017

1 commit


02 Mar, 2017

1 commit