10 Sep, 2020

1 commit

  • commit 17743798d81238ab13050e8e2833699b54e15467 upstream.

    There is a race between the assignment of `table->data` and write value
    to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
    the other thread.

    CPU0: CPU1:
    proc_sys_write
    hugetlb_sysctl_handler proc_sys_call_handler
    hugetlb_sysctl_handler_common hugetlb_sysctl_handler
    table->data = &tmp; hugetlb_sysctl_handler_common
    table->data = &tmp;
    proc_doulongvec_minmax
    do_proc_doulongvec_minmax sysctl_head_finish
    __do_proc_doulongvec_minmax unuse_table
    i = table->data;
    *i = val; // corrupt CPU1's stack

    Fix this by duplicating the `table`, and only update the duplicate of
    it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
    simplify the code.

    The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
    ? set_max_huge_pages+0x3da/0x4f0
    ? alloc_pool_huge_page+0x150/0x150
    ? proc_doulongvec_minmax+0x46/0x60
    ? hugetlb_sysctl_handler_common+0x1c7/0x200
    ? nr_hugepages_store+0x20/0x20
    ? copy_fd_bitmaps+0x170/0x170
    ? hugetlb_sysctl_handler+0x1e/0x20
    ? proc_sys_call_handler+0x2f1/0x300
    ? unregister_sysctl_table+0xb0/0xb0
    ? __fd_install+0x78/0x100
    ? proc_sys_write+0x14/0x20
    ? __vfs_write+0x4d/0x90
    ? vfs_write+0xef/0x240
    ? ksys_write+0xc0/0x160
    ? __ia32_sys_read+0x50/0x50
    ? __close_fd+0x129/0x150
    ? __x64_sys_write+0x43/0x50
    ? do_syscall_64+0x6c/0x200
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     

26 Aug, 2020

1 commit

  • commit 75802ca66354a39ab8e35822747cd08b3384a99a upstream.

    This is found by code observation only.

    Firstly, the worst case scenario should assume the whole range was covered
    by pmd sharing. The old algorithm might not work as expected for ranges
    like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
    expected range should be (0, 2g).

    Since at it, remove the loop since it should not be required. With that,
    the new code should be faster too when the invalidating range is huge.

    Mike said:

    : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
    : adjust to (0, 1g+2m) which is incorrect.
    :
    : We should cc stable. The original reason for adjusting the range was to
    : prevent data corruption (getting wrong page). Since the range is not
    : always adjusted correctly, the potential for corruption still exists.
    :
    : However, I am fairly confident that adjust_range_if_pmd_sharing_possible
    : is only gong to be called in two cases:
    :
    : 1) for a single page
    : 2) for range == entire vma
    :
    : In those cases, the current code should produce the correct results.
    :
    : To be safe, let's just cc stable.

    Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages")
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Cc:
    Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mike Kravetz
    Signed-off-by: Greg Kroah-Hartman

    Peter Xu
     

29 Apr, 2020

1 commit

  • commit 3c1d7e6ccb644d517a12f73a7ff200870926f865 upstream.

    Our machine encountered a panic(addressing exception) after run for a
    long time and the calltrace is:

    RIP: hugetlb_fault+0x307/0xbe0
    RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286
    RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
    RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
    RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
    R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
    R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
    FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    follow_hugetlb_page+0x175/0x540
    __get_user_pages+0x2a0/0x7e0
    __get_user_pages_unlocked+0x15d/0x210
    __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
    try_async_pf+0x6e/0x2a0 [kvm]
    tdp_page_fault+0x151/0x2d0 [kvm]
    ...
    kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
    kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
    do_vfs_ioctl+0x3f0/0x540
    SyS_ioctl+0xa1/0xc0
    system_call_fastpath+0x22/0x27

    For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
    may return a wrong 'pmdp' if there is a race. Please look at the
    following code snippet:

    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
    return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
    return (pte_t *)pud;

    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
    return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
    return (pte_t *)pmd;
    ...

    The following sequence would trigger this bug:

    - CPU0: sz = PUD_SIZE and *pud = 0 , continue
    - CPU0: "pud_huge(*pud)" is false
    - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
    - CPU0: "!pud_present(*pud)" is false, continue
    - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp

    However, we want CPU0 to return NULL or pudp in this case.

    We must make sure there is exactly one dereference of pud and pmd.

    Signed-off-by: Longpeng
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Jason Gunthorpe
    Cc: Matthew Wilcox
    Cc: Sean Christopherson
    Cc:
    Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Longpeng
     

09 Jan, 2020

1 commit

  • [ Upstream commit c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d ]

    The following lockdep splat was observed when a certain hugetlbfs test
    was run:

    ================================
    WARNING: inconsistent lock state
    4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - -
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
    {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    __nr_hugepages_store_common+0x11b/0xb30
    hugetlb_sysctl_handler_common+0x209/0x2d0
    proc_sys_call_handler+0x37f/0x450
    vfs_write+0x157/0x460
    ksys_write+0xb8/0x170
    do_syscall_64+0xa5/0x4d0
    entry_SYSCALL_64_after_hwframe+0x6a/0xdf
    irq event stamp: 691296
    hardirqs last enabled at (691296): [] _raw_spin_unlock_irqrestore+0x4b/0x60
    hardirqs last disabled at (691295): [] _raw_spin_lock_irqsave+0x22/0x81
    softirqs last enabled at (691284): [] irq_enter+0xc3/0xe0
    softirqs last disabled at (691285): [] irq_exit+0x23e/0x2b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(hugetlb_lock);

    lock(hugetlb_lock);

    *** DEADLOCK ***
    :
    Call Trace:

    __lock_acquire+0x146b/0x48c0
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    free_huge_page+0x36f/0xaa0
    bio_check_pages_dirty+0x2fc/0x5c0
    clone_endio+0x17f/0x670 [dm_mod]
    blk_update_request+0x276/0xe50
    scsi_end_request+0x7b/0x6a0
    scsi_io_completion+0x1c6/0x1570
    blk_done_softirq+0x22e/0x350
    __do_softirq+0x23d/0xad8
    irq_exit+0x23e/0x2b0
    do_IRQ+0x11a/0x200
    common_interrupt+0xf/0xf

    Both the hugetbl_lock and the subpool lock can be acquired in
    free_huge_page(). One way to solve the problem is to make both locks
    irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
    held for a linear scan of ALL hugetlb pages during a cgroup reparentling
    operation. So it is just too long to have irq disabled unless we can
    break hugetbl_lock down into finer-grained locks with shorter lock hold
    times.

    Another alternative is to defer the freeing to a workqueue job. This
    patch implements the deferred freeing by adding a free_hpage_workfn()
    work function to do the actual freeing. The free_huge_page() call in a
    non-task context saves the page to be freed in the hpage_freelist linked
    list in a lockless manner using the llist APIs.

    The generic workqueue is used to process the work, but a dedicated
    workqueue can be used instead if it is desirable to have the huge page
    freed ASAP.

    Thanks to Kirill Tkhai for suggesting the use of
    llist APIs which simplfy the code.

    Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Mike Kravetz
    Acked-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Reviewed-by: Kirill Tkhai
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     

19 Oct, 2019

1 commit

  • Uninitialized memmaps contain garbage and in the worst case trigger
    kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
    touched.

    Let's make sure that we only consider online memory (managed by the
    buddy) that has initialized memmaps. ZONE_DEVICE is not applicable.

    page_zone() will call page_to_nid(), which will trigger
    VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
    and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This
    can be the case when an offline memory block (e.g., never onlined) is
    spanned by a zone.

    Note: As explained by Michal in [1], alloc_contig_range() will verify
    the range. So it boils down to the wrong access in this function.

    [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reported-by: Michal Hocko
    Acked-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: Anshuman Khandual
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

25 Sep, 2019

1 commit

  • When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the
    pages will be interleaved between all nodes of the system. If nodes are
    not equal, it is quite possible for one node to fill up before the others.
    When this happens, the code still attempts to allocate pages from the
    full node. This results in calls to direct reclaim and compaction which
    slow things down considerably.

    When allocating pool pages, note the state of the previous allocation for
    each node. If previous allocation failed, do not use the aggressive retry
    algorithm on successive attempts. The allocation will still succeed if
    there is memory available, but it will not try as hard to free up memory.

    Link: http://lkml.kernel.org/r/20190806014744.15446-5-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

14 Aug, 2019

1 commit

  • Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
    the kernel-v5.2.3 testing. This is caused by a race between hugetlb
    page migration and page fault.

    If a hugetlb page can not be allocated to satisfy a page fault, the task
    is sent SIGBUS. This is normal hugetlbfs behavior. A hugetlb fault
    mutex exists to prevent two tasks from trying to instantiate the same
    page. This protects against the situation where there is only one
    hugetlb page, and both tasks would try to allocate. Without the mutex,
    one would fail and SIGBUS even though the other fault would be
    successful.

    There is a similar race between hugetlb page migration and fault.
    Migration code will allocate a page for the target of the migration. It
    will then unmap the original page from all page tables. It does this
    unmap by first clearing the pte and then writing a migration entry. The
    page table lock is held for the duration of this clear and write
    operation. However, the beginnings of the hugetlb page fault code
    optimistically checks the pte without taking the page table lock. If
    clear (as it can be during the migration unmap operation), a hugetlb
    page allocation is attempted to satisfy the fault. Note that the page
    which will eventually satisfy this fault was already allocated by the
    migration code. However, the allocation within the fault path could
    fail which would result in the task incorrectly being sent SIGBUS.

    Ideally, we could take the hugetlb fault mutex in the migration code
    when modifying the page tables. However, locks must be taken in the
    order of hugetlb fault mutex, page lock, page table lock. This would
    require significant rework of the migration code. Instead, the issue is
    addressed in the hugetlb fault code. After failing to allocate a huge
    page, take the page table lock and check for huge_pte_none before
    returning an error. This is the same check that must be made further in
    the code even if page allocation is successful.

    Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Mike Kravetz
    Reported-by: Li Wang
    Tested-by: Li Wang
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Cyril Hrubis
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Jun, 2019

1 commit

  • madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
    for hugepages with overcommitting enabled. That was caused by the
    suboptimal code in current soft-offline code. See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
    MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
    ...
    } else {
    /*
    * We set PG_hwpoison only when the migration source hugepage
    * was successfully dissolved, because otherwise hwpoisoned
    * hugepage remains on free hugepage list, then userspace will
    * find it as SIGBUS by allocation failure. That's not expected
    * in soft-offlining.
    */
    ret = dissolve_free_huge_page(page);
    if (!ret) {
    if (set_hwpoison_free_buddy_page(page))
    num_poisoned_pages_inc();
    }
    }
    return ret;

    Here dissolve_free_huge_page() returns -EBUSY if the migration source page
    was freed into buddy in migrate_pages(), but even in that case we actually
    has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
    current code gives up offlining too early now.

    dissolve_free_huge_page() checks that a given hugepage is suitable for
    dissolving, where we should return success for !PageHuge() case because
    the given hugepage is considered as already dissolved.

    This change also affects other callers of dissolve_free_huge_page(), which
    are cleaned up together.

    [n-horiguchi@ah.jp.nec.com: v3]
    Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Chen, Jerry T
    Tested-by: Chen, Jerry T
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: "Chen, Jerry T"
    Cc: "Zhuo, Qiuxu"
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

9 commits

  • Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
    resv_map") brought up the issue that inode->i_mapping may not point to the
    address space embedded within the inode at inode eviction time. The
    hugetlbfs truncate routine handles this by explicitly using inode->i_data.
    However, code cleaning up the resv_map will still use the address space
    pointed to by inode->i_mapping. Luckily, private_data is NULL for address
    spaces in all such cases today but, there is no guarantee this will
    continue.

    Change all hugetlbfs code getting a resv_map pointer to explicitly get it
    from the address space embedded within the inode. In addition, add more
    comments in the code to indicate why this is being done.

    Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Yufen Yu
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • hugetlb uses a fault mutex hash table to prevent page faults of the
    same pages concurrently. The key for shared and private mappings is
    different. Shared keys off address_space and file index. Private keys
    off mm and virtual address. Consider a private mappings of a populated
    hugetlbfs file. A fault will map the page from the file and if needed
    do a COW to map a writable page.

    Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
    pages. It uses the address_space file index key. However, private
    mappings will use a different key and could race with this code to map
    the file page. This causes problems (BUG) for the page cache remove
    code as it expects the page to be unmapped. A sample stack is:

    page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
    kernel BUG at mm/filemap.c:169!
    ...
    RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
    ...
    Call Trace:
    __delete_from_page_cache+0x39/0x220
    delete_from_page_cache+0x45/0x70
    remove_inode_hugepages+0x13c/0x380
    ? __add_to_page_cache_locked+0x162/0x380
    hugetlbfs_fallocate+0x403/0x540
    ? _cond_resched+0x15/0x30
    ? __inode_security_revalidate+0x5d/0x70
    ? selinux_file_permission+0x100/0x130
    vfs_fallocate+0x13f/0x270
    ksys_fallocate+0x3c/0x80
    __x64_sys_fallocate+0x1a/0x20
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    There seems to be another potential COW issue/race with this approach
    of different private and shared keys as noted in commit 8382d914ebf7
    ("mm, hugetlb: improve page-fault scalability").

    Since every hugetlb mapping (even anon and private) is actually a file
    mapping, just use the address_space index key for all mappings. This
    results in potentially more hash collisions. However, this should not
    be the common case.

    Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When a huge page is allocated, PagePrivate() is set if the allocation
    consumed a reservation. When freeing a huge page, PagePrivate is checked.
    If set, it indicates the reservation should be restored. PagePrivate
    being set at free huge page time mostly happens on error paths.

    When huge page reservations are created, a check is made to determine if
    the mapping is associated with an explicitly mounted filesystem. If so,
    pages are also reserved within the filesystem. The default action when
    freeing a huge page is to decrement the usage count in any associated
    explicitly mounted filesystem. However, if the reservation is to be
    restored the reservation/use count within the filesystem should not be
    decrementd. Otherwise, a subsequent page allocation and free for the same
    mapping location will cause the file filesystem usage to go 'negative'.

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G -4.0M 4.1G - /opt/hugepool

    To fix, when freeing a huge page do not adjust filesystem usage if
    PagePrivate() is set to indicate the reservation should be restored.

    I did not cc stable as the problem has been around since reserves were
    added to hugetlbfs and nobody has noticed.

    Link: http://lkml.kernel.org/r/20190328234704.27083-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • NODEMASK_ALLOC is used to allocate a nodemask bitmap, and it does it by
    first determining whether it should be allocated on the stack or
    dynamically, depending on NODES_SHIFT. Right now, it goes the dynamic
    path whenever the nodemask_t is above 32 bytes.

    Although we could bump it to a reasonable value, the largest a nodemask_t
    can get is 128 bytes, so since __nr_hugepages_store_common is called from
    a rather short stack we can just get rid of the NODEMASK_ALLOC call here.

    This reduces some code churn and complexity.

    Link: http://lkml.kernel.org/r/20190402133415.21983-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Alex Ghiti
    Cc: David Rientjes
    Cc: Jing Xiangfeng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • The number of node specific huge pages can be set via a file such as:
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
    When a node specific value is specified, the global number of huge pages
    must also be adjusted. This adjustment is calculated as the specified
    node specific value + (global value - current node value). If the node
    specific value provided by the user is large enough, this calculation
    could overflow an unsigned long leading to a smaller than expected number
    of huge pages.

    To fix, check the calculation for overflow. If overflow is detected, use
    ULONG_MAX as the requested value. This is inline with the user request to
    allocate as many huge pages as possible.

    It was also noticed that the above calculation was done outside the
    hugetlb_lock. Therefore, the values could be inconsistent and result in
    underflow. To fix, the calculation is moved within the routine
    set_max_huge_pages() where the lock is held.

    In addition, the code in __nr_hugepages_store_common() which tries to
    handle the case of not being able to allocate a node mask would likely
    result in incorrect behavior. Luckily, it is very unlikely we will ever
    take this path. If we do, simply return ENOMEM.

    Link: http://lkml.kernel.org/r/20190328220533.19884-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jing Xiangfeng
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Oscar Salvador
    Cc: David Rientjes
    Cc: Alex Ghiti
    Cc: Jing Xiangfeng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • On systems without CONTIG_ALLOC activated but that support gigantic pages,
    boottime reserved gigantic pages can not be freed at all. This patch
    simply enables the possibility to hand back those pages to memory
    allocator.

    Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: David S. Miller [sparc]
    Reviewed-by: Mike Kravetz
    Cc: Andy Lutomirsky
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: Heiko Carstens
    Cc: "H . Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • spinlock recursion happened when do LTP test:
    #!/bin/bash
    ./runltp -p -f hugetlb &
    ./runltp -p -f hugetlb &
    ./runltp -p -f hugetlb &
    ./runltp -p -f hugetlb &
    ./runltp -p -f hugetlb &

    The dtor returned by get_compound_page_dtor in __put_compound_page may be
    the function of free_huge_page which will lock the hugetlb_lock, so don't
    put_page in lock of hugetlb_lock.

    BUG: spinlock recursion on CPU#0, hugemmap05/1079
    lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, .owner_cpu: 0
    Call trace:
    dump_backtrace+0x0/0x198
    show_stack+0x24/0x30
    dump_stack+0xa4/0xcc
    spin_dump+0x84/0xa8
    do_raw_spin_lock+0xd0/0x108
    _raw_spin_lock+0x20/0x30
    free_huge_page+0x9c/0x260
    __put_compound_page+0x44/0x50
    __put_page+0x2c/0x60
    alloc_surplus_huge_page.constprop.19+0xf0/0x140
    hugetlb_acct_memory+0x104/0x378
    hugetlb_reserve_pages+0xe0/0x250
    hugetlbfs_file_mmap+0xc0/0x140
    mmap_region+0x3e8/0x5b0
    do_mmap+0x280/0x460
    vm_mmap_pgoff+0xf4/0x128
    ksys_mmap_pgoff+0xb4/0x258
    __arm64_sys_mmap+0x34/0x48
    el0_svc_common+0x78/0x130
    el0_svc_handler+0x38/0x78
    el0_svc+0x8/0xc

    Link: http://lkml.kernel.org/r/b8ade452-2d6b-0372-32c2-703644032b47@huawei.com
    Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
    Signed-off-by: Kai Shen
    Signed-off-by: Feilong Lin
    Reported-by: Wang Wang
    Reviewed-by: Oscar Salvador
    Reviewed-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kai Shen
     

07 May, 2019

1 commit

  • Pull unified TLB flushing from Ingo Molnar:
    "This contains the generic mmu_gather feature from Peter Zijlstra,
    which is an all-arch unification of TLB flushing APIs, via the
    following (broad) steps:

    - enhance the APIs to cover more arch details

    - convert most TLB flushing arch implementations to the generic
    APIs.

    - remove leftovers of per arch implementations

    After this series every single architecture makes use of the unified
    TLB flushing APIs"

    * 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm/resource: Use resource_overlaps() to simplify region_intersects()
    ia64/tlb: Eradicate tlb_migrate_finish() callback
    asm-generic/tlb: Remove tlb_table_flush()
    asm-generic/tlb: Remove tlb_flush_mmu_free()
    asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
    asm-generic/tlb: Remove arch_tlb*_mmu()
    s390/tlb: Convert to generic mmu_gather
    asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
    arch/tlb: Clean up simple architectures
    um/tlb: Convert to generic mmu_gather
    sh/tlb: Convert SH to generic mmu_gather
    ia64/tlb: Convert to generic mmu_gather
    arm/tlb: Convert to generic mmu_gather
    asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
    asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish()
    asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm()
    asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range()
    asm-generic/tlb, arch: Provide generic VIPT cache flush
    asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
    asm-generic/tlb: Provide a comment

    Linus Torvalds
     

15 Apr, 2019

2 commits

  • Merge page ref overflow branch.

    Jann Horn reported that he can overflow the page ref count with
    sufficient memory (and a filesystem that is intentionally extremely
    slow).

    Admittedly it's not exactly easy. To have more than four billion
    references to a page requires a minimum of 32GB of kernel memory just
    for the pointers to the pages, much less any metadata to keep track of
    those pointers. Jann needed a total of 140GB of memory and a specially
    crafted filesystem that leaves all reads pending (in order to not ever
    free the page references and just keep adding more).

    Still, we have a fairly straightforward way to limit the two obvious
    user-controllable sources of page references: direct-IO like page
    references gotten through get_user_pages(), and the splice pipe page
    duplication. So let's just do that.

    * branch page-refs:
    fs: prevent page refcount overflow in pipe_buf_get
    mm: prevent get_user_pages() from overflowing page refcount
    mm: add 'try_get_page()' helper function
    mm: make page ref count overflow check tighter and more explicit

    Linus Torvalds
     
  • If the page refcount wraps around past zero, it will be freed while
    there are still four billion references to it. One of the possible
    avenues for an attacker to try to make this happen is by doing direct IO
    on a page multiple times. This patch makes get_user_pages() refuse to
    take a new page reference if there are already more than two billion
    references to the page.

    Reported-by: Jann Horn
    Acked-by: Matthew Wilcox
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Apr, 2019

1 commit

  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Mar, 2019

4 commits

  • This patch updates get_user_pages_longterm to migrate pages allocated
    out of CMA region. This makes sure that we don't keep non-movable pages
    (due to page reference count) in the CMA area.

    This will be used by ppc64 in a later patch to avoid pinning pages in
    the CMA region. ppc64 uses CMA region for allocation of the hardware
    page table (hash page table) and not able to migrate pages out of CMA
    region results in page table allocation failures.

    One case where we hit this easy is when a guest using a VFIO passthrough
    device. VFIO locks all the guest's memory and if the guest memory is
    backed by CMA region, it becomes unmovable resulting in fragmenting the
    CMA and possibly preventing other guests from allocation a large enough
    hash page table.

    NOTE: We allocate the new page without using __GFP_THISNODE

    Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: David Gibson
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Follow the regular pte change protection
    sequence for hugetlb too. This allows the architectures to override the
    update sequence.

    Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "arm64/mm: Enable HugeTLB migration", v4.

    This patch series enables HugeTLB migration support for all supported
    huge page sizes at all levels including contiguous bit implementation.
    Following HugeTLB migration support matrix has been enabled with this
    patch series. All permutations have been tested except for the 16GB.

    CONT PTE PMD CONT PMD PUD
    -------- --- -------- ---
    4K: 64K 2M 32M 1G
    16K: 2M 32M 1G
    64K: 2M 512M 16G

    First the series adds migration support for PUD based huge pages. It
    then adds a platform specific hook to query an architecture if a given
    huge page size is supported for migration while also providing a default
    fallback option preserving the existing semantics which just checks for
    (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
    migration on arm64 and subscribe to this new platform specific hook by
    defining an override.

    The second patch differentiates between movability and migratability
    aspects of huge pages and implements hugepage_movable_supported() which
    can then be used during allocation to decide whether to place the huge
    page in movable zone or not.

    This patch (of 5):

    During huge page allocation it's migratability is checked to determine
    if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
    But the movability aspect of the huge page could depend on other factors
    than just migratability. Movability in itself is a distinct property
    which should not be tied with migratability alone.

    This differentiates these two and implements an enhanced movability check
    which also considers huge page size to determine if it is feasible to be
    placed under a movable zone. At present it just checks for gigantic pages
    but going forward it can incorporate other enhanced checks.

    Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Steve Capper
    Reviewed-by: Naoya Horiguchi
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Mar, 2019

1 commit

  • hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 Feb, 2019

1 commit

  • hugetlb needs the same fix as faultin_nopage (which was applied in
    commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle
    FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
    released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
    in the FOLL_NOWAIT case.

    Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
    Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
    Signed-off-by: Andrea Arcangeli
    Tested-by: "Dr. David Alan Gilbert"
    Reported-by: "Dr. David Alan Gilbert"
    Reviewed-by: Mike Kravetz
    Reviewed-by: Peter Xu
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Jan, 2019

2 commits

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This reverts c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54

    The reverted commit caused ABBA deadlocks when file migration raced with
    file eviction for specific hugetlbfs files. This was discovered with a
    modified version of the LTP move_pages12 test.

    The purpose of the reverted patch was to close a long existing race
    between hugetlbfs file truncation and page faults. After more analysis
    of the patch and impacted code, it was determined that i_mmap_rwsem can
    not be used for all required synchronization. Therefore, revert this
    patch while working an another approach to the underlying issue.

    Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Jan, 2019

1 commit


29 Dec, 2018

3 commits

  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken into
    account as well.

    Instead of writing the required complicated code for this rare occurrence,
    just eliminate the race. i_mmap_rwsem is now held in read mode for the
    duration of page fault processing. Hold i_mmap_rwsem longer in truncation
    and hold punch code to cover the call to remove_inode_hugepages.

    With this modification, code in remove_inode_hugepages checking for races
    becomes 'dead' as it can not longer happen. Remove the dead code and
    expand comments to explain reasoning. Similarly, checks for races with
    truncation in the page fault path can be simplified and removed.

    [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
    Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
    Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

15 Dec, 2018

1 commit

  • A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page)
    in free_huge_page(). Unfortunately, the page->mapping field was set to
    NULL before this test. This made it more difficult to determine the
    root cause of the problem.

    Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do
    trigger more information is present in the page struct.

    Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com
    Signed-off-by: Yongkai Wu
    Acked-by: Michal Hocko
    Acked-by: Mike Kravetz
    Reviewed-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yongkai Wu
     

01 Dec, 2018

1 commit

  • Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

19 Nov, 2018

1 commit

  • This bug has been experienced several times by the Oracle DB team. The
    BUG is in remove_inode_hugepages() as follows:

    /*
    * If page is mapped, it was faulted in after being
    * unmapped in caller. Unmap (again) now after taking
    * the fault mutex. The mutex will prevent faults
    * until we finish removing the page.
    *
    * This race can only happen in the hole punch case.
    * Getting here in a truncate operation is a bug.
    */
    if (unlikely(page_mapped(page))) {
    BUG_ON(truncate_op);

    In this case, the elevated map count is not the result of a race.
    Rather it was incorrectly incremented as the result of a bug in the huge
    pmd sharing code. Consider the following:

    - Process A maps a hugetlbfs file of sufficient size and alignment
    (PUD_SIZE) that a pmd page could be shared.

    - Process B maps the same hugetlbfs file with the same size and
    alignment such that a pmd page is shared.

    - Process B then calls mprotect() to change protections for the mapping
    with the shared pmd. As a result, the pmd is 'unshared'.

    - Process B then calls mprotect() again to chage protections for the
    mapping back to their original value. pmd remains unshared.

    - Process B then forks and process C is created. During the fork
    process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
    tables. Copying page tables for hugetlb mappings is done in the
    routine copy_hugetlb_page_range.

    In copy_hugetlb_page_range(), the destination pte is obtained by:

    dst_pte = huge_pte_alloc(dst, addr, sz);

    If pmd sharing is possible, the returned pointer will be to a pte in an
    existing page table. In the situation above, process C could share with
    either process A or process B. Since process A is first in the list,
    the returned pte is a pointer to a pte in process A's page table.

    However, the check for pmd sharing in copy_hugetlb_page_range is:

    /* If the pagetables are shared don't copy or take references */
    if (dst_pte == src_pte)
    continue;

    Since process C is sharing with process A instead of process B, the
    above test fails. The code in copy_hugetlb_page_range which follows
    assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
    from src_pte to dst_pte and increments this map count of the associated
    page. This is how we end up with an elevated map count.

    To solve, check the dst_pte entry for huge_pte_none. If !none, this
    implies PMD sharing so do not copy.

    Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
    Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

31 Oct, 2018

3 commits

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
    identical MEMBLOCK definitions.

    Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
    $(git grep -l memblock_virt_alloc)

    Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport