19 Apr, 2014

1 commit

  • soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
    hugetlb: fix softlockup when a large number of hugepages are freed." can
    happen in return_unused_surplus_pages(), so let's fix it.

    Signed-off-by: Masayoshi Mizuma
    Signed-off-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     

08 Apr, 2014

5 commits

  • When I decrease the value of nr_hugepage in procfs a lot, softlockup
    happens. It is because there is no chance of context switch during this
    process.

    On the other hand, when I allocate a large number of hugepages, there is
    some chance of context switch. Hence softlockup doesn't happen during
    this process. So it's necessary to add the context switch in the
    freeing process as same as allocating process to avoid softlockup.

    When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
    process occupied a CPU over 150 seconds and following softlockup message
    appeared twice or more.

    $ echo 6000000 > /proc/sys/vm/nr_hugepages
    $ cat /proc/sys/vm/nr_hugepages
    6000000
    $ grep ^Huge /proc/meminfo
    HugePages_Total: 6000000
    HugePages_Free: 6000000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    $ echo 0 > /proc/sys/vm/nr_hugepages

    BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
    Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
    Call Trace:
    free_pool_huge_page+0xb8/0xd0
    set_max_huge_pages+0x128/0x190
    hugetlb_sysctl_handler_common+0x113/0x140
    hugetlb_sysctl_handler+0x1e/0x20
    proc_sys_call_handler+0x97/0xd0
    proc_sys_write+0x14/0x20
    vfs_write+0xb8/0x1a0
    sys_write+0x51/0x90
    __audit_syscall_exit+0x265/0x290
    system_call_fastpath+0x16/0x1b

    I have not confirmed this problem with upstream kernels because I am not
    able to prepare the machine equipped with 12TB memory now. However I
    confirmed that the amount of decreasing hugepages was directly
    proportional to the amount of required time.

    I measured required times on a smaller machine. It showed 130-145
    hugepages decreased in a millisecond.

    Amount of decreasing Required time Decreasing rate
    hugepages (msec) (pages/msec)
    ------------------------------------------------------------
    10,000 pages == 20GB 70 - 74 135-142
    30,000 pages == 60GB 208 - 229 131-144

    It means decrement of 6TB hugepages will trigger softlockup with the
    default threshold 20sec, in this decreasing rate.

    Signed-off-by: Masayoshi Mizuma
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Signed-off-by: Choi Gi-yong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Choi Gi-yong
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • The NUMA scanning code can end up iterating over many gigabytes of
    unpopulated memory, especially in the case of a freshly started KVM
    guest with lots of memory.

    This results in the mmu notifier code being called even when there are
    no mapped pages in a virtual address range. The amount of time wasted
    can be enough to trigger soft lockup warnings with very large KVM
    guests.

    This patch moves the mmu notifier call to the pmd level, which
    represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
    code is only called from the address in the PMD where present mappings
    are first encountered.

    The hugetlbfs code is left alone for now; hugetlb mappings are not
    relocatable, and as such are left alone by the NUMA code, and should
    never trigger this problem to begin with.

    Signed-off-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • huge_pte_offset() could return NULL, so we need NULL check to avoid
    potential NULL pointer dereferences.

    Signed-off-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Apr, 2014

8 commits

  • Both prep_compound_huge_page() and prep_compound_gigantic_page() are
    only called at bootstrap and can be marked as __init.

    The __SetPageTail(page) in prep_compound_gigantic_page() happening
    before page->first_page is initialized is not concerning since this is
    bootstrap.

    Signed-off-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Joonsoo Kim
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The kernel can currently only handle a single hugetlb page fault at a
    time. This is due to a single mutex that serializes the entire path.
    This lock protects from spurious OOM errors under conditions of low
    availability of free hugepages. This problem is specific to hugepages,
    because it is normal to want to use every single hugepage in the system
    - with normal pages we simply assume there will always be a few spare
    pages which can be used temporarily until the race is resolved.

    Address this problem by using a table of mutexes, allowing a better
    chance of parallelization, where each hugepage is individually
    serialized. The hash key is selected depending on the mapping type.
    For shared ones it consists of the address space and file offset being
    faulted; while for private ones the mm and virtual address are used.
    The size of the table is selected based on a compromise of collisions
    and memory footprint of a series of database workloads.

    Large database workloads that make heavy use of hugepages can be
    particularly exposed to this issue, causing start-up times to be
    painfully slow. This patch reduces the startup time of a 10 Gb Oracle
    DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads
    will naturally benefit even more.

    NOTE:
    The only downside to this patch, detected by Joonsoo Kim, is that a
    small race is possible in private mappings: A child process (with its
    own mm, after cow) can instantiate a page that is already being handled
    by the parent in a cow fault. When low on pages, can trigger spurious
    OOMs. I have not been able to think of a efficient way of handling
    this... but do we really care about such a tiny window? We already
    maintain another theoretical race with normal pages. If not, one
    possible way to is to maintain the single hash for private mappings --
    any workloads that *really* suffer from this scaling problem should
    already use shared mappings.

    [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
    Signed-off-by: Davidlohr Bueso
    Cc: Aneesh Kumar K.V
    Cc: David Gibson
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Util now, we get a resv_map by two ways according to each mapping type.
    This makes code dirty and unreadable. Unify it.

    [davidlohr@hp.com: code cleanups]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is a preparation patch to unify the use of vma_resv_map()
    regardless of the map type. This patch prepares it by removing
    resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
    for all resv_maps.

    [davidlohr@hp.com: update changelog]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a race condition if we map a same file on different processes.
    Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
    When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
    mmap_sem (exclusively). This doesn't prevent other tasks from modifying
    the region structure, so it can be modified by two processes
    concurrently.

    To solve this, introduce a spinlock to resv_map and make region
    manipulation function grab it before they do actual work.

    [davidlohr@hp.com: updated changelog]
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Joonsoo Kim
    Suggested-by: Joonsoo Kim
    Acked-by: David Gibson
    Cc: David Gibson
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • To change a protection method for region tracking to find grained one,
    we pass the resv_map, instead of list_head, to region manipulation
    functions.

    This doesn't introduce any functional change, and it is just for
    preparing a next step.

    [davidlohr@hp.com: update changelog]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 Jan, 2014

1 commit

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

5 commits

  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fallback to
    exiting bootmem APIs.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: Santosh Shilimkar
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tejun Heo
    Cc: Tony Lindgren
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grygorii Strashko
     
  • When copy_hugetlb_page_range() is called to copy a range of hugetlb
    mappings, the secondary MMUs are not notified if there is a protection
    downgrade, which breaks COW semantics in KVM.

    This patch adds the necessary MMU notifier calls.

    Signed-off-by: Andreas Sandberg
    Acked-by: Steve Capper
    Acked-by: Marc Zyngier
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Sandberg
     
  • No actual need of it. So keep it internal.

    Signed-off-by: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Signed-off-by: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • get_page_foll() is more optimal and is always safe to use under the PT
    lock. More so for hugetlbfs as there's no risk of race conditions with
    split_huge_page regardless of the PT lock.

    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

22 Nov, 2013

2 commits

  • Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

15 Nov, 2013

1 commit

  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Oct, 2013

2 commits

  • Commit 11feeb498086 ("kvm: optimize away THP checks in
    kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
    compound pages.

    That commit depends on the assumption that PG_reserved is identical for
    all head and tail pages of a compound page. So that if get_user_pages
    returns a tail page, we don't need to check the head page in order to
    know if we deal with a reserved page that requires different
    refcounting.

    The assumption that PG_reserved is the same for head and tail pages is
    certainly correct for THP and regular hugepages, but gigantic hugepages
    allocated through bootmem don't clear the PG_reserved on the tail pages
    (the clearing of PG_reserved is done later only if the gigantic hugepage
    is freed).

    This patch corrects the gigantic compound page initialization so that we
    can retain the optimization in 11feeb498086. The cacheline was already
    modified in order to set PG_tail so this won't affect the boot time of
    large memory systems.

    [akpm@linux-foundation.org: tweak comment layout and grammar]
    Signed-off-by: Andrea Arcangeli
    Reported-by: andy123
    Acked-by: Rik van Riel
    Cc: Gleb Natapov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We should clear the page's private flag when returing the page to the
    hugepage pool. Otherwise, marked hugepage can be allocated to the user
    who tries to allocate the non-reserved hugepage. If this user fail to
    map this hugepage, he would try to return the page to the hugepage pool.
    Since this page has a private flag, resv_huge_pages would mistakenly
    increase. This patch fixes this situation.

    Signed-off-by: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Sep, 2013

15 commits

  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Until now we can't offline memory blocks which contain hugepages because a
    hugepage is considered as an unmovable page. But now with this patch
    series, a hugepage has become movable, so by using hugepage migration we
    can offline such memory blocks.

    What's different from other users of hugepage migration is that we need to
    decompose all the hugepages inside the target memory block into free buddy
    pages after hugepage migration, because otherwise free hugepages remaining
    in the memory block intervene the memory offlining. For this reason we
    introduce new functions dissolve_free_huge_page() and
    dissolve_free_huge_pages().

    Other than that, what this patch does is straightforwardly to add hugepage
    migration code, that is, adding hugepage code to the functions which scan
    over pfn and collect hugepages to be migrated, and adding a hugepage
    allocation function to alloc_migrate_target().

    As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
    over them because it's larger than memory block. So we now simply leave
    it to fail as it is.

    [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
    migrate hugepage with mbind(2) after applying the enablement patch which
    comes later in this series.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migration is available only for soft offlining, but
    it's also useful for some other users of page migration (clearly because
    users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
    So this patchset tries to extend such users to support hugepage migration.

    The target of this patchset is to enable hugepage migration for NUMA
    related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
    memory hotplug.

    This patchset does not add hugepage migration for memory compaction,
    because users of memory compaction mainly expect to construct thp by
    arranging raw pages, and there's little or no need to compact hugepages.
    CMA, another user of page migration, can have benefit from hugepage
    migration, but is not enabled to support it for now (just because of lack
    of testing and expertise in CMA.)

    Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
    x86_64, or hugepages in architectures like ia64) is not enabled for now
    (again, because of lack of testing.)

    As for how these are achived, I extended the API (migrate_pages()) to
    handle hugepage (with patch 1 and 2) and adjusted code of each caller to
    check and collect movable hugepages (with patch 3-7). Remaining 2 patches
    are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
    about making sure that we only migrate pmd-based hugepages. And patch 9
    is about choosing appropriate zone for hugepage allocation.

    My test is mainly functional one, simply kicking hugepage migration via
    each entry point and confirm that migration is done correctly. Test code
    is available here:

    git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

    And I always run libhugetlbfs test when changing hugetlbfs's code. With
    this patchset, no regression was found in the test.

    This patch (of 9):

    Before enabling each user of page migration to support hugepage,
    this patch enables the list of pages for migration to link not only
    LRU pages, but also hugepages. As a result, putback_movable_pages()
    and migrate_pages() can handle both of LRU pages and hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If we fail with a reserved page, just calling put_page() is not
    sufficient, because put_page() invoke free_huge_page() at last step and it
    doesn't know whether a page comes from a reserved pool or not. So it
    doesn't do anything related to reserved count. This makes reserve count
    lower than how we need, because reserve count already decrease in
    dequeue_huge_page_vma(). This patch fix this situation.

    Signed-off-by: Joonsoo Kim
    Cc: Aneesh Kumar
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to grab a page_table_lock when we try to release a page.
    So, defer to grab a page_table_lock.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Aneesh Kumar
    Reviewed-by: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for
    private. So we don't need to check whether this mapping is for shared or
    not.

    This patch is just for clean-up.

    Signed-off-by: Joonsoo Kim
    Cc: Aneesh Kumar
    Cc: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
    So, we should check subpool counter when avoid_reserve. This patch
    implement it.

    Signed-off-by: Joonsoo Kim
    Cc: Aneesh Kumar
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • 'reservations' is so long name as a variable and we use 'resv_map' to
    represent 'struct resv_map' in other place. To reduce confusion and
    unreadability, change it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar
    Cc: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Don't use the reserve pool when soft offlining a hugepage. Check we have
    free pages outside the reserve pool before we dequeue the huge page.
    Otherwise, we can steal other's reserve page.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar
    Cc: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: David Gibson
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If a vma with VM_NORESERVE allocate a new page for page cache, we should
    check whether this area is reserved or not. If this address is already
    reserved by other process(in case of chg == 0), we should decrement
    reserve count, because this allocated page will go into page cache and
    currently, there is no way to know that this page comes from reserved pool
    or not when releasing inode. This may introduce over-counting problem to
    reserved count. With following example code, you can easily reproduce
    this situation.

    Assume 2MB, nr_hugepages = 100

    size = 20 * MB;
    flag = MAP_SHARED;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    return -1;
    }

    flag = MAP_SHARED | MAP_NORESERVE;
    q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (q == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    }
    q[0] = 'c';

    After finish the program, run 'cat /proc/meminfo'. You can see below
    result.

    HugePages_Free: 100
    HugePages_Rsvd: 1

    To fix this, we should check our mapping type and tracked region. If our
    mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current
    allocated page will go into page cache which is already reserved region
    when mapping is created. In this case, we should decrease reserve count.
    As implementing above, this patch solve the problem.

    [akpm@linux-foundation.org: fix spelling in comment]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, Checking condition of decrement_hugepage_resv_vma() and
    vma_has_reserves() is same, so we can clean-up this function with
    vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
    one call site, so we can remove function and embed it into
    dequeue_huge_page_vma() directly. This patch implement it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to
    check reserve counting and eventually we cannot be ensured to allocate a
    huge page in fault time. With following example code, you can easily find
    this situation.

    Assume 2MB, nr_hugepages = 100

    fd = hugetlbfs_unlinked_fd();
    if (fd < 0)
    return 1;

    size = 200 * MB;
    flag = MAP_SHARED;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    return -1;
    }

    size = 2 * MB;
    flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    }
    p[0] = '0';
    sleep(10);

    During executing sleep(10), run 'cat /proc/meminfo' on another process.

    HugePages_Free: 99
    HugePages_Rsvd: 100

    Number of free should be higher or equal than number of reserve, but this
    aren't. This represent that non reserved shared mapping steal a reserved
    page. Non reserved shared mapping should not eat into reserve space.

    If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
    that we don't have reserved pages, then we check that we have enough free
    pages in dequeue_huge_page_vma(). This prevent to steal a reserved page.

    With this change, above test generate a SIGBUG which is correct, because
    all free pages are reserved and non reserved shared mapping can't get a
    free page.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, we use a page with mapped count 1 in page cache for cow
    optimization. If we find this condition, we don't allocate a new page and
    copy contents. Instead, we map this page directly. This may introduce a
    problem that writting to private mapping overwrite hugetlb file directly.
    You can find this situation with following code.

    size = 20 * MB;
    flag = MAP_SHARED;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    return -1;
    }
    p[0] = 's';
    fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
    munmap(p, size);

    flag = MAP_PRIVATE;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    }
    p[0] = 'c';
    munmap(p, size);

    flag = MAP_SHARED;
    p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
    if (p == MAP_FAILED) {
    fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
    return -1;
    }
    fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
    munmap(p, size);

    We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE
    WRITE: s". If we turn off this optimization to a page in page cache, the
    problem is disappeared.

    So, I change the trigger condition of optimization. If this page is not
    AnonPage, we don't do optimization. This makes this optimization turning
    off for a page cache.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Hocko
    Reviewed-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If list is empty, list_for_each_entry_safe() doesn't do anything. So,
    this check is redundant. Remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Hocko
    Reviewed-by: Wanpeng Li
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Wanpeng Li
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim