20 Jan, 2017

1 commit

  • commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream.

    return_unused_surplus_pages() decrements the global reservation count,
    and frees any unused surplus pages that were backing the reservation.

    Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
    return_unused_surplus_pages()") added a call to cond_resched_lock in the
    loop freeing the pages.

    As a result, the hugetlb_lock could be dropped, and someone else could
    use the pages that will be freed in subsequent iterations of the loop.
    This could result in inconsistent global hugetlb page state, application
    api failures (such as mmap) failures or application crashes.

    When dropping the lock in return_unused_surplus_pages, make sure that
    the global reservation count (resv_huge_pages) remains sufficiently
    large to prevent someone else from claiming pages about to be freed.

    Analyzed by Paul Cassella.

    Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
    Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Paul Cassella
    Suggested-by: Michal Hocko
    Cc: Masayoshi Mizuma
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

12 Jan, 2017

1 commit

  • commit 3999f52e3198e76607446ab1a4610c1ddc406c56 upstream.

    We cannot use the pte value used in set_pte_at for pte_same comparison,
    because archs like ppc64, filter/add new pte flag in set_pte_at.
    Instead fetch the pte value inside hugetlb_cow. We are comparing pte
    value to make sure the pte didn't change since we dropped the page table
    lock. hugetlb_cow get called with page table lock held, and we can take
    a copy of the pte value before we drop the page table lock.

    With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
    previous mapping (huge_pte_none entries), by forcing a cow in the fault
    path. This avoid take an addition fault to covert a read-only mapping
    to read/write. Here we were comparing a recently instantiated pte (via
    set_pte_at) to the pte values from linux page table. As explained above
    on ppc64 such pte_same check returned wrong result, resulting in us
    taking an additional fault on ppc64.

    Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit")
    Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Jan Stancek
    Acked-by: Hillf Danton
    Cc: Mike Kravetz
    Cc: Scott Wood
    Cc: Michael Ellerman
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

12 Nov, 2016

1 commit

  • Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
    allocated huge page.

    If a reservation was associated with the huge page, alloc_huge_page()
    consumed the reservation while allocating. When the newly allocated
    page is freed in free_huge_page(), it will increment the global
    reservation count. However, the reservation entry in the reserve map
    will remain.

    This is not an issue for shared mappings as the entry in the reserve map
    indicates a reservation exists. But, an entry in a private mapping
    reserve map indicates the reservation was consumed and no longer exists.
    This results in an inconsistency between the reserve map and the global
    reservation count. This 'leaks' a reserved huge page.

    Create a new routine restore_reserve_on_error() to restore the reserve
    entry in these specific error paths. This routine makes use of a new
    function vma_add_reservation() which will add a reserve entry for a
    specific address/page.

    In general, these error paths were rarely (if ever) taken on most
    architectures. However, powerpc contained arch specific code that that
    resulted in an extra fault and execution of these error paths on all
    private mappings.

    Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
    Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Tested-by: Jan Stancek
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Kirill A . Shutemov
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

08 Oct, 2016

5 commits

  • When the huge page is added to the page cahce (huge_add_to_page_cache),
    the page private flag will be cleared. since this code
    (remove_inode_hugepages) will only be called for pages in the page
    cahce, PagePrivate(page) will always be false.

    The patch remove the code without any functional change.

    Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
    page. No functional change with this patch.

    Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hanjun Guo
    Cc: Will Deacon
    Cc: Dave Hansen
    Cc: Sudeep Holla
    Cc: Catalin Marinas
    Cc: Mark Rutland
    Cc: Rob Herring
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
    call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
    the page is not huge at all or a hugepage that is in-use.

    Improve this by doing the PageHuge() and page_count() checks already in
    dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In
    dissolve_free_huge_page(), when holding the spinlock, those checks need
    to be revalidated.

    Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • In dissolve_free_huge_pages(), free hugepages will be dissolved without
    making sure that there are enough of them left to satisfy hugepage
    reservations.

    Fix this by adding a return value to dissolve_free_huge_pages() and
    checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
    lead to the situation where dissolve_free_huge_page() returns an error
    and all free hugepages that were dissolved before that error are lost,
    while the memory block still cannot be set offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

    This addresses several issues with hugepages and memory offline. While
    the first patch fixes a panic, and is therefore rather important, the
    last patch is just a performance optimization.

    The second patch fixes a theoretical issue with reserved hugepages,
    while still leaving some ugly usability issue, see description.

    This patch (of 3):

    dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
    list corruption and addressing exception when trying to set a memory
    block offline that is part (but not the first part) of a "gigantic"
    hugetlb page with a size > memory block size.

    When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
    will trigger directly. In the other case we will run into an addressing
    exception later, because dissolve_free_huge_page() will not work on the
    head page of the compound hugetlb page which will result in a NULL
    hstate from page_hstate().

    To fix this, first remove the VM_BUG_ON() because it is wrong, and then
    use the compound head page in dissolve_free_huge_page(). This means
    that an unused pre-allocated gigantic page that has any part of itself
    inside the memory block that is going offline will be dissolved
    completely. Losing an unused gigantic hugepage is preferable to failing
    the memory offline, for example in the situation where a (possibly
    faulty) memory DIMM needs to go offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

12 Aug, 2016

1 commit

  • When memory hotplug operates, free hugepages will be freed if the
    movable node is offline. Therefore, /proc/sys/vm/nr_hugepages will be
    incorrect.

    Fix it by reducing max_huge_pages when the node is offlined.

    n-horiguchi@ah.jp.nec.com said:

    : dissolve_free_huge_page intends to break a hugepage into buddy, and the
    : destination hugepage is supposed to be allocated from the pool of the
    : destination node, so the system-wide pool size is reduced. So adding
    : h->max_huge_pages-- makes sense to me.

    Link: http://lkml.kernel.org/r/1470624546-902-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

05 Aug, 2016

1 commit

  • Pull more powerpc updates from Michael Ellerman:
    "These were delayed for various reasons, so I let them sit in next a
    bit longer, rather than including them in my first pull request.

    Fixes:
    - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
    - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
    - Move register_process_table() out of ppc_md from Michael Ellerman

    Use jump_label use for [cpu|mmu]_has_feature():
    - Add mmu_early_init_devtree() from Michael Ellerman
    - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
    - Do hash device tree scanning earlier from Michael Ellerman
    - Do radix device tree scanning earlier from Michael Ellerman
    - Do feature patching before MMU init from Michael Ellerman
    - Check features don't change after patching from Michael Ellerman
    - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
    - Convert mmu_has_feature() to returning bool from Michael Ellerman
    - Convert cpu_has_feature() to returning bool from Michael Ellerman
    - Define radix_enabled() in one place & use static inline from Michael Ellerman
    - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
    - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
    - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
    - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
    - Remove mfvtb() from Kevin Hao
    - Move cpu_has_feature() to a separate file from Kevin Hao
    - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
    - Add option to use jump label for cpu_has_feature() from Kevin Hao
    - Add option to use jump label for mmu_has_feature() from Kevin Hao
    - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
    - Annotate jump label assembly from Michael Ellerman

    TLB flush enhancements from Aneesh Kumar K.V:
    - radix: Implement tlb mmu gather flush efficiently
    - Add helper for finding SLBE LLP encoding
    - Use hugetlb flush functions
    - Drop multiple definition of mm_is_core_local
    - radix: Add tlb flush of THP ptes
    - radix: Rename function and drop unused arg
    - radix/hugetlb: Add helper for finding page size
    - hugetlb: Add flush_hugetlb_tlb_range
    - remove flush_tlb_page_nohash

    Add new ptrace regsets from Anshuman Khandual and Simon Guo:
    - elf: Add powerpc specific core note sections
    - Add the function flush_tmregs_to_thread
    - Enable in transaction NT_PRFPREG ptrace requests
    - Enable in transaction NT_PPC_VMX ptrace requests
    - Enable in transaction NT_PPC_VSX ptrace requests
    - Adapt gpr32_get, gpr32_set functions for transaction
    - Enable support for NT_PPC_CGPR
    - Enable support for NT_PPC_CFPR
    - Enable support for NT_PPC_CVMX
    - Enable support for NT_PPC_CVSX
    - Enable support for TM SPR state
    - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    - Enable support for EBB registers
    - Enable support for Performance Monitor registers"

    * tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
    powerpc/mm: Move register_process_table() out of ppc_md
    powerpc/perf: Fix incorrect event codes in power9-event-list
    powerpc/32: Fix early access to cpu_spec relocation
    powerpc/ptrace: Enable support for Performance Monitor registers
    powerpc/ptrace: Enable support for EBB registers
    powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    powerpc/ptrace: Enable support for TM SPR state
    powerpc/ptrace: Enable support for NT_PPC_CVSX
    powerpc/ptrace: Enable support for NT_PPC_CVMX
    powerpc/ptrace: Enable support for NT_PPC_CFPR
    powerpc/ptrace: Enable support for NT_PPC_CGPR
    powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
    powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
    powerpc/process: Add the function flush_tmregs_to_thread
    elf: Add powerpc specific core note sections
    powerpc/mm: remove flush_tlb_page_nohash
    powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
    ...

    Linus Torvalds
     

03 Aug, 2016

2 commits

  • Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
    runs his database load with memory online and offline running in
    parallel. The reason is that huge_pmd_share might detect a shared pmd
    which is currently migrated and so it has migration pte which is
    !pte_huge.

    There doesn't seem to be any easy way to prevent from the race and in
    fact seeing the migration swap entry is not harmful. Both callers of
    huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
    will copy the swap entry and make it COW if needed. hugetlb_fault will
    back off and so the page fault is retries if the page is still under
    migration and waits for its completion in hugetlb_fault.

    That means that the BUG_ON is wrong and we should update it. Let's
    simply check that all present ptes are pte_huge instead.

    Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In powerpc servers with large memory(32TB), we watched several soft
    lockups for hugepage under stress tests.

    The call traces are as follows:
    1.
    get_page_from_freelist+0x2d8/0xd50
    __alloc_pages_nodemask+0x180/0xc20
    alloc_fresh_huge_page+0xb0/0x190
    set_max_huge_pages+0x164/0x3b0

    2.
    prep_new_huge_page+0x5c/0x100
    alloc_fresh_huge_page+0xc8/0x190
    set_max_huge_pages+0x164/0x3b0

    This patch fixes such soft lockups. It is safe to call cond_resched()
    there because it is out of spin_lock/unlock section.

    Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Paul Gortmaker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia He
     

01 Aug, 2016

1 commit


29 Jul, 2016

3 commits

  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
    let's remove incorrect comment.

    The reason why the page lock is not really needed is that
    dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
    hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
    are just allocated but not linked to active list yet, even without
    taking page lock.

    Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Zhan Chen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     

27 Jul, 2016

4 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • This allows an arch which needs to do special handing with respect to
    different page size when flushing tlb to implement the same in mmu
    gather.

    Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • For hugetlb like THP (and unlike regular page), we do tlb flush after
    dropping ptl. Because of the above, we don't need to track force_flush
    like we do now. Instead we can simply call tlb_remove_page() which will
    do the flush if needed.

    No functionality change in this patch.

    Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Pull s390 updates from Martin Schwidefsky:
    "There are a couple of new things for s390 with this merge request:

    - a new scheduling domain "drawer" is added to reflect the unusual
    topology found on z13 machines. Performance tests showed up to 8
    percent gain with the additional domain.

    - the new crc-32 checksum crypto module uses the vector-galois-field
    multiply and sum SIMD instruction to speed up crc-32 and crc-32c.

    - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
    the generic vmlinux.lds linker script definitions.

    - kcov instrumentation support. A prerequisite for that is the
    inline assembly basic block cleanup, which is the reason for the
    net/iucv/iucv.c change.

    - support for 2GB pages is added to the hugetlbfs backend.

    Then there are two removals:

    - the oprofile hardware sampling support is dead code and is removed.
    The oprofile user space uses the perf interface nowadays.

    - the ETR clock synchronization is removed, this has been superseeded
    be the STP clock synchronization. And it always has been
    "interesting" code..

    And the usual bug fixes and cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
    s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
    s390/smp: clean up a condition
    s390/cio/chp : Remove deprecated create_singlethread_workqueue
    s390/chsc: improve channel path descriptor determination
    s390/chsc: sanitize fmt check for chp_desc determination
    s390/cio: make fmt1 channel path descriptor optional
    s390/chsc: fix ioctl CHSC_INFO_CU command
    s390/cio/device_ops: fix kernel doc
    s390/cio: allow to reset channel measurement block
    s390/console: Make preferred console handling more consistent
    s390/mm: fix gmap tlb flush issues
    s390/mm: add support for 2GB hugepages
    s390: have unique symbol for __switch_to address
    s390/cpuinfo: show maximum thread id
    s390/ptrace: clarify bits in the per_struct
    s390: stack address vs thread_info
    s390: remove pointless load within __switch_to
    s390: enable kcov support
    s390/cpumf: use basic block for ecctr inline assembly
    s390/hypfs: use basic block for diag inline assembly
    ...

    Linus Torvalds
     

15 Jul, 2016

1 commit

  • The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
    worth: the syzkaller fuzzer hit it again. It's still wrong for some THP
    cases, because linear_page_index() was never intended to apply to
    addresses before the start of a vma.

    That's easily fixed with a signed long cast inside linear_page_index();
    and Dmitry has tested such a patch, to verify the false positive. But
    why extend linear_page_index() just for this case? when the avoidance in
    page_move_anon_rmap() has already grown ugly, and there's no reason for
    the check at all (nothing else there is using address or index).

    Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
    remove CONFIG_DEBUG_VM PageTransHuge adjustment.

    And one more thing: should the compound_head(page) be done inside or
    outside page_move_anon_rmap()? It's usually pushed down to the lowest
    level nowadays (and mm/memory.c shows no other explicit use of it), so I
    think it's better done in page_move_anon_rmap() than by caller.

    Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Mika Westerberg
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Jul, 2016

1 commit


01 Jul, 2016

1 commit


25 Jun, 2016

2 commits

  • While working on s390 support for gigantic hugepages I ran into the
    following "Bad page state" warning when freeing gigantic pages:

    BUG: Bad page state in process bash pfn:580001
    page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
    flags: 0x7fffc0000000000()
    page dumped because: non-NULL mapping

    This is because page->compound_mapcount, which is part of a union with
    page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
    and not cleared again during destroy_compound_gigantic_page(). Fix this
    by clearing the compound_mapcount in destroy_compound_gigantic_page()
    before clearing compound_head.

    Interestingly enough, the warning will not show up on x86_64, although
    this should not be architecture specific. Apparently there is an
    endianness issue, combined with the fact that the union contains both a
    64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
    members. The resulting bogus page->mapping on x86_64 therefore contains
    00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
    trigger the PageAnon() check in free_pages_prepare() because
    page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
    like x86_64 in this case (the page is not compound anymore,
    ->compound_head was already cleared before). As a result, page->mapping
    will be cleared before doing the checks in free_pages_check().

    Not sure if the bogus "PageAnon() returning true" on x86_64 for the
    first tail page of a gigantic page (at this stage) has other theoretical
    implications, but they would also be fixed with this patch.

    Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Mike Kravetz
    Cc: Luiz Capitulino
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: "Kirill A . Shutemov"
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • We account HugeTLB's shared page table to all processes who share it.
    The accounting happens during huge_pmd_share().

    If somebody populates pud entry under us, we should decrease pagetable's
    refcount and decrease nr_pmds of the process.

    By mistake, I increase nr_pmds again in this case. :-/ It will lead to
    "BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.

    Let's fix this by increasing nr_pmds only when we're sure that the page
    table will be used.

    Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
    Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: zhongjiang
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

10 Jun, 2016

1 commit

  • When creating a private mapping of a hugetlbfs file, it is possible to
    unmap pages via ftruncate or fallocate hole punch. If subsequent faults
    repopulate these mappings, the reserve counts will go negative. This is
    because the code currently assumes all faults to private mappings will
    consume reserves. The problem can be recreated as follows:

    - mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
    - write fault in pages in the mapping
    - fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
    - write fault in pages in the hole

    This will result in negative huge page reserve counts and negative
    subpool usage counts for the hugetlbfs. Note that this can also be
    recreated with ftruncate, but fallocate is more straight forward.

    This patch modifies the routines vma_needs_reserves and vma_has_reserves
    to examine the reserve map associated with private mappings similar to
    that for shared mappings. However, the reserve map semantics for
    private and shared mappings are very different. This results in subtly
    different code that is explained in the comments.

    Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: Kirill Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

30 May, 2016

1 commit


24 May, 2016

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this update was stabilized before the merge window and
    appeared in -next. The "device dax" implementation was revised this
    week in response to review feedback, and to address failures detected
    by the recently expanded ndctl unit test suite.

    Not included in this pull request are two dax topic branches (dax
    error handling, and dax radix-tree locking). These topics were
    deferred to get a few more days of -next integration testing, and to
    coordinate a branch baseline with Ted and the ext4 tree. Vishal and
    Ross will send the error handling and locking topics respectively in
    the next few days.

    This branch has received a positive build result from the kbuild robot
    across 226 configs.

    Summary:

    - Device DAX for persistent memory: Device DAX is the device-centric
    analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
    ranges to be allocated and mapped without need of an intervening
    file system. Device DAX is strict, precise and predictable.
    Specifically this interface:

    a) Guarantees fault granularity with respect to a given page size
    (pte, pmd, or pud) set at configuration time.

    b) Enforces deterministic behavior by being strict about what
    fault scenarios are supported.

    Persistent memory is the first target, but the mechanism is also
    targeted for exclusive allocations of performance/feature
    differentiated memory ranges.

    - Support for the HPE DSM (device specific method) command formats.
    This enables management of these first generation devices until a
    unified DSM specification materializes.

    - Further ACPI 6.1 compliance with support for the common dimm
    identifier format.

    - Various fixes and cleanups across the subsystem"

    * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
    libnvdimm, dax: fix deletion
    libnvdimm, dax: fix alignment validation
    libnvdimm, dax: autodetect support
    libnvdimm: release ida resources
    Revert "block: enable dax for raw block devices"
    /dev/dax, core: file operations and dax-mmap
    /dev/dax, pmem: direct access to persistent memory
    libnvdimm: stop requiring a driver ->remove() method
    libnvdimm, dax: record the specified alignment of a dax-device instance
    libnvdimm, dax: reserve space to store labels for device-dax
    libnvdimm, dax: introduce device-dax infrastructure
    nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
    tools/testing/nvdimm: ND_CMD_CALL support
    nfit: disable vendor specific commands
    nfit: export subsystem ids as attributes
    nfit: fix format interface code byte order per ACPI6.1
    nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
    nfit, libnvdimm: clarify "commands" vs "_DSMs"
    libnvdimm: increase max envelope size for ioctl
    acpi/nfit: Add sysfs "id" for NVDIMM ID
    ...

    Linus Torvalds
     

21 May, 2016

1 commit

  • The "Device DAX" core enables dax mappings of performance / feature
    differentiated memory. An open mapping or file handle keeps the backing
    struct device live, but new mappings are only possible while the device
    is enabled. Faults are handled under rcu_read_lock to synchronize
    with the enabled state of the device.

    Similar to the filesystem-dax case the backing memory may optionally
    have struct page entries. However, unlike fs-dax there is no support
    for private mappings, or mappings that are not backed by media (see
    use of zero-page in fs-dax).

    Mappings are always guaranteed to match the alignment of the dax_region.
    If the dax_region is configured to have a 2MB alignment, all mappings
    are guaranteed to be backed by a pmd entry. Contrast this determinism
    with the fs-dax case where pmd mappings are opportunistic. If userspace
    attempts to force a misaligned mapping, the driver will fail the mmap
    attempt. See dax_dev_check_vma() for other scenarios that are rejected,
    like MAP_PRIVATE mappings.

    Cc: Hannes Reinecke
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Ross Zwisler
    Acked-by: "Paul E. McKenney"
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

20 May, 2016

5 commits

  • This patchset deals with some problematic sites that iterate pfn ranges.

    There is a system thats node's pfns are overlapped as follows:

    -----pfn-------->
    N0 N1 N2 N0 N1 N2

    Therefore, we need to take care of this overlapping when iterating pfn
    range.

    I audit many iterating sites that uses pfn_valid(), pfn_valid_within(),
    zone_start_pfn and etc. and others looks safe to me. This is a
    preparation step for a new CMA implementation, ZONE_CMA
    (https://lkml.org/lkml/2015/2/12/95), because it would be easily
    overlapped with other zones. But, zone overlap check is also needed for
    the general case so I send it separately.

    This patch (of 5):

    alloc_gigantic_page() uses alloc_contig_range() and this requires that
    the requested range is in a single zone. To satisfy this requirement,
    add this check to pfn_range_valid_gigantic().

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Laura Abbott
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: "Aneesh Kumar K.V"
    Cc: "Rafael J. Wysocki"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Instead of open-coding it.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When any unsupported hugepage size is specified, 'hugepagesz=' and
    'hugepages=' should be ignored during command line parsing until any
    supported hugepage size is found. But currently incorrect number of
    hugepages are allocated when unsupported size is specified as it fails
    to ignore the 'hugepages=' command.

    Test case:

    Note that this is specific to x86 architecture.

    Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
    After boot, dmesg output shows that X number of hugepages of the size 2M
    is pre-allocated instead of 0.

    So, to handle such command line options, introduce new routine
    hugetlb_bad_size. The routine hugetlb_bad_size sets the global variable
    parsed_valid_hugepagesz. We are using parsed_valid_hugepagesz to save
    the state when unsupported hugepagesize is found so that we can ignore
    the 'hugepages=' parameters after that and then reset the variable when
    supported hugepage size is found.

    The routine hugetlb_bad_size can be called while setting 'hugepagesz='
    parameter in an architecture specific code.

    Signed-off-by: Vaishali Thakkar
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Yaowei Bai
    Cc: Dominik Dingel
    Cc: Kirill A. Shutemov
    Cc: Paul Gortmaker
    Cc: Dave Hansen
    Cc: Benjamin Herrenschmidt
    Cc: James Hogan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vaishali Thakkar
     
  • It was observed that minimum size accounting associated with the
    hugetlbfs min_size mount option may not perform optimally and as
    expected. As huge pages/reservations are released from the filesystem
    and given back to the global pools, they are reserved for subsequent
    filesystem use as long as the subpool reserved count is less than
    subpool minimum size. It does not take into account used pages within
    the filesystem. The filesystem size limits are not exceeded and this is
    technically not a bug. However, better behavior would be to wait for
    the number of used pages/reservations associated with the filesystem to
    drop below the minimum size before taking reservations to satisfy
    minimum size.

    An optimization is also made to the hugepage_subpool_get_pages() routine
    which is called when pages/reservations are allocated. This does not
    change behavior, but simply avoids the accounting if all reservations
    have already been taken (subpool reserved count == 0).

    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

1 commit

  • There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
    consistently.

    Miscellanea:

    - Coalesce formats
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

10 Mar, 2016

2 commits

  • Replace ENOTSUPP with EOPNOTSUPP. If hugepages are not supported, this
    value is propagated to userspace. EOPNOTSUPP is part of uapi and is
    widely supported by libc libraries.

    It gives nicer message to user, rather than:

    # cat /proc/sys/vm/nr_hugepages
    cat: /proc/sys/vm/nr_hugepages: Unknown error 524

    And also LTP's proc01 test was failing because this ret code (524)
    was unexpected:

    proc01 1 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
    proc01 2 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
    proc01 3 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524

    Signed-off-by: Jan Stancek
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Acked-by: Hillf Danton
    Cc: Mike Kravetz
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     
  • The warning message "killed due to inadequate hugepage pool" simply
    indicates that SIGBUS was sent, not that the process was forcibly killed.
    If the process has a signal handler installed does not fix the problem,
    this message can rapidly spam the kernel log.

    On my amd64 dev machine that does not have hugepages configured, I can
    reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
    4 megabytes of huge pages) and running something that sets a signal
    handler and forks, like

    #include
    #include
    #include
    #include

    sig_atomic_t counter = 10;
    void handler(int signal)
    {
    if (counter-- == 0)
    exit(0);
    }

    int main(void)
    {
    int status;
    char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {perror("mmap"); return 1;}
    *addr = 'x';
    switch (fork()) {
    case -1:
    perror("fork"); return 1;
    case 0:
    signal(SIGBUS, handler);
    *addr = 'x';
    break;
    default:
    *addr = 'x';
    wait(&status);
    if (WIFSIGNALED(status)) {
    psignal(WTERMSIG(status), "child");
    }
    break;
    }
    }

    Signed-off-by: Geoffrey Thomas
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geoffrey Thomas
     

19 Feb, 2016

1 commit

  • Currently incorrect default hugepage pool size is reported by proc
    nr_hugepages when number of pages for the default huge page size is
    specified twice.

    When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
    indicates the current number of pre-allocated huge pages of the default
    size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
    max_huge_pages and after boot time pre-allocation, max_huge_pages should
    equal the number of pre-allocated pages (nr_hugepages).

    Test case:

    Note that this is specific to x86 architecture.

    Boot the kernel with command line option 'default_hugepagesz=1G
    hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
    boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
    returns the value X. However, dmesg output shows that Z huge pages were
    pre-allocated.

    So, the root cause of the problem here is that the global variable
    default_hstate_max_huge_pages is set if a default huge page size is
    specified (directly or indirectly) on the command line. After the command
    line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
    the value is assigned to default_hstae.max_huge_pages. However,
    default_hstate.max_huge_pages may have already been set based on the
    number of pre-allocated huge pages of default_hstate size.

    The solution to this problem is if hstate->max_huge_pages is already set
    then it should not set as a result of global max_huge_pages value.
    Basically if the value of the variable hugepages is set multiple times on
    a command line for a specific supported hugepagesize then proc layer
    should consider the last specified value.

    Signed-off-by: Vaishali Thakkar
    Reviewed-by: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vaishali Thakkar
     

06 Feb, 2016

1 commit

  • Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
    at runtime") has added the runtime gigantic page allocation via
    alloc_contig_range(), making this support available only when CONFIG_CMA
    is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
    associated infrastructure, it is possible with few simple adjustments to
    require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

    After this patch, alloc_contig_range() and related functions are
    available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
    enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
    supporting runtime gigantic pages without the CMA-specific checks in
    page allocator fastpaths.

    Signed-off-by: Vlastimil Babka
    Cc: Luiz Capitulino
    Cc: Kirill A. Shutemov
    Cc: Zhang Yanfei
    Cc: Yasuaki Ishimatsu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka