26 Jul, 2011

1 commit


16 Jun, 2011

1 commit

  • Johannes noticed the vmstat update is already taken care of by
    khugepaged_alloc_hugepage() internally. The only places that are required
    to update the vmstat are the callers of alloc_hugepage (callers of
    khugepaged_alloc_hugepage aren't).

    Signed-off-by: Andrea Arcangeli
    Reported-by: Johannes Weiner
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

2 commits

  • We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(),
    the mmap_sem is only hold for keeping the vma stable and we don't need the
    vma stable anymore after we return from alloc_hugepage_vma().

    Signed-off-by: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Straightforward conversion of anon_vma->lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Apr, 2011

1 commit

  • The huge_memory.c THP page fault was allowed to run if vm_ops was null
    (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
    setup a special vma->vm_ops and it would fallback to regular anonymous
    memory) but other THP logics weren't fully activated for vmas with vm_file
    not NULL (/dev/zero has a not NULL vma->vm_file).

    So this removes the vm_file checks so that /dev/zero also can safely use
    THP (the other albeit safer approach to fix this bug would have been to
    prevent the THP initial page fault to run if vm_file was set).

    After removing the vm_file checks, this also makes huge_memory.c stricter
    in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
    check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
    under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
    only be allowed to exist before the first page fault, and in turn when
    vma->anon_vma is null (so preventing khugepaged registration). So I tend
    to think the previous comment saying if vm_file was set, VM_PFNMAP might
    have been set and we could still be registered in khugepaged (despite
    anon_vma was not NULL to be registered in khugepaged) was too paranoid.
    The is_linear_pfn_mapping check is also I think superfluous (as described
    by comment) but under DEBUG_VM it is safe to stay.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

    Signed-off-by: Andrea Arcangeli
    Reported-by: Caspar Zhang
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Apr, 2011

2 commits

  • The conventional format for boolean attributes in sysfs is numeric ("0" or
    "1" followed by new-line). Any boolean attribute can then be read and
    written using a generic function. Using the strings "yes [no]", "[yes]
    no" (read), "yes" and "no" (write) will frustrate this.

    [akpm@linux-foundation.org: use kstrtoul()]
    [akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
    Signed-off-by: Ben Hutchings
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Tested-by: David Rientjes
    Cc: NeilBrown
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     
  • I found it difficult to make sense of transparent huge pages without
    having any counters for its actions. Add some counters to vmstat for
    allocation of transparent hugepages and fallback to smaller pages.

    Optional patch, but useful for development and understanding the system.

    Contains improvements from Andrea Arcangeli and Johannes Weiner

    [akpm@linux-foundation.org: coding-style fixes]
    [hannes@cmpxchg.org: fix vmstat_text[] entries]
    Signed-off-by: Andi Kleen
    Acked-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

23 Mar, 2011

1 commit

  • Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
    hugepages daemon. This way the low level accounting for local versus
    remote pages works correctly.

    Contains improvements from Andrea Arcangeli

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Mar, 2011

1 commit

  • THP's collapse_huge_page() has an understandable but ugly difference
    in when its huge page is allocated: inside if NUMA but outside if not.
    It's hardly surprising that the memcg failure path forgot that, freeing
    the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
    (or even worse, using the freed page).

    Signed-off-by: Hugh Dickins
    Reviewed-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Mar, 2011

3 commits

  • Pass down the correct node for a transparent hugepage allocation. Most
    callers continue to use the current node, however the hugepaged daemon
    now uses the previous node of the first to be collapsed page instead.
    This ensures that khugepaged does not mess up local memory for an
    existing process which uses local policy.

    The choice of node is somewhat primitive currently: it just uses the
    node of the first page in the pmd range. An alternative would be to
    look at multiple pages and use the most popular node. I used the
    simplest variant for now which should work well enough for the case of
    all pages being on the same node.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This makes a difference for LOCAL policy, where the node cannot be
    determined from the policy itself, but has to be gotten from the original
    page.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Currently alloc_pages_vma() always uses the local node as policy node for
    the LOCAL policy. Pass this node down as an argument instead.

    No behaviour change from this patch, but will be needed for followons.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

16 Feb, 2011

1 commit

  • Transparent hugepages can only be created if rmap is fully
    functional. So we must prevent hugepages to be created while
    is_vma_temporary_stack() is true.

    This also optmizes away some harmless but unnecessary setting of
    khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Feb, 2011

1 commit

  • mem_cgroup_uncharge_page() should be called in all failure cases after
    mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

    [ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
    [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
    [ 4209.078674] page flags: 0x40000000004000(head)
    [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
    [ 4209.082177] (/A)
    [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
    [ 4209.083412] Call Trace:
    [ 4209.083678] [] ? bad_page+0xe4/0x140
    [ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
    [ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
    [ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
    [ 4209.086110] [] ? free_compound_page+0x1b/0x20
    [ 4209.086699] [] ? __put_compound_page+0x1c/0x30
    [ 4209.087333] [] ? put_compound_page+0x4d/0x200
    [ 4209.087935] [] ? put_page+0x45/0x50
    [ 4209.097361] [] ? khugepaged+0x9e9/0x1430
    [ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
    [ 4209.099121] [] ? khugepaged+0x0/0x1430
    [ 4209.099780] [] ? kthread+0x96/0xa0
    [ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
    [ 4209.101214] [] ? kthread+0x0/0xa0
    [ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

03 Feb, 2011

1 commit

  • When the tail page of THP is poisoned, the head page will be poisoned too.
    And the wrong address, address of head page, will be sent with sigbus
    always.

    So when the poisoned page is used by Guest OS which is running on KVM,
    after the address changing(hva->gpa) by qemu, the unexpected process on
    Guest OS will be killed by sigbus.

    What we expected is that the process using the poisoned tail page could be
    killed on Guest OS, but not that the process using the healthy head page
    is killed.

    Since it is not good to poison the healthy page, avoid poisoning other
    than the page which is really poisoned.
    (While we poison all pages in a huge page in case of hugetlb,
    we can do this for THP thanks to split_huge_page().)

    Here we fix two parts:
    1. Isolate the poisoned page only to make sure
    the reported address is the address of poisoned page.
    2. make the poisoned page work as the poisoned regular page.

    [akpm@linux-foundation.org: fix spello in comment]
    Signed-off-by: Jin Dongming
    Reviewed-by: Hidetoshi Seto
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jin Dongming
     

21 Jan, 2011

2 commits

  • Now, under THP:

    at charge:
    - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
    ....set to 512 pages.
    at uncharge
    - PageCgroupUsed bit is unset on the head page.

    So, some pages will remain with "Used" bit.

    This patch fixes that Used bit is set only to the head page.
    Used bits for tail pages will be set at splitting if necessary.

    This patch adds this lock order:
    compound_lock() -> page_cgroup_move_lock().

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Two users reported THP-related crashes on 32-bit x86 machines. Their oops
    reports indicated an invalid pte, and subsequent code inspection showed
    that the highpte is actually used after unmap.

    The fix is to unmap the pte only after all operations against it are
    finished.

    Signed-off-by: Johannes Weiner
    Reported-by: Ilya Dryomov
    Reported-by: werner
    Cc: Andrea Arcangeli
    Tested-by: Ilya Dryomov
    Tested-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

21 commits

  • MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
    mmap and before touching the memory. While this is enough for most
    usages, it's little effort to make madvise more dynamic at runtime on an
    existing mapping by making khugepaged aware about madvise.

    MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
    fault (that may not ever happen if all pages are already mapped and the
    "enabled" knob was set to madvise during the initial page faults).

    MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
    collapsing pages where not needed.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andrea Arcangeli
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
    statistics, so the Active(anon) and Inactive(anon) statistics in
    /proc/meminfo are correct.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • On small systems, the extra memory used by the anti-fragmentation memory
    reserve and simply because huge pages are smaller than large pages can
    easily outweigh the benefits of less TLB misses.

    A less obvious concern is if run on a NUMA machine with asymmetric node
    sizes and one of them is very small. The reserve could make the node
    unusable.

    In case of the crashdump kernel, OOMs have been observed due to the
    anti-fragmentation memory reserve taking up a large fraction of the
    crashdump image.

    This patch disables transparent hugepages on systems with less than 1GB of
    RAM, but the hugepage subsystem is fully initialized so administrators can
    enable THP through /sys if desired.

    Signed-off-by: Rik van Riel
    Acked-by: Avi Kiviti
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • It's unclear why schedule friendly kernel threads can't be taken away by
    the CPU through the scheduler itself. It's safer to stop them as they can
    trigger memory allocation, if kswapd also freezes itself to avoid
    generating I/O they have too.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • For GRU and EPT, we need gup-fast to set referenced bit too (this is why
    it's correct to return 0 when shadow_access_mask is zero, it requires
    gup-fast to set the referenced bit). qemu-kvm access already sets the
    young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
    paging EPT minor fault we relay on gup-fast to signal the page is in
    use...

    We also need to check the young bits on the secondary pagetables for NPT
    and not nested shadow mmu as the data may never get accessed again by the
    primary pte.

    Without this closer accuracy, we'd have to remove the heuristic that
    avoids collapsing hugepages in hugepage virtual regions that have not even
    a single subpage in use.

    ->test_young is full backwards compatible with GRU and other usages that
    don't have young bits in pagetables set by the hardware and that should
    nuke the secondary mmu mappings when ->clear_flush_young runs just like
    EPT does.

    Removing the heuristic that checks the young bit in
    khugepaged/collapse_huge_page completely isn't so bad either probably but
    I thought it was worth it and this makes it reliable.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Archs implementing Transparent Hugepage Support must implement a function
    called has_transparent_hugepage to be sure the virtual or physical CPU
    supports Transparent Hugepages.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • An huge pmd can only be mapped if the corresponding 2M virtual range is
    fully contained in the vma. At times the VM calls split_vma twice, if the
    first split_vma succeeds and the second fail, the first split_vma remains
    in effect and it's not rolled back. For split_vma or vma_adjust to fail
    an allocation failure is needed so it's a very unlikely event (the out of
    memory killer would normally fire before any allocation failure is visible
    to kernel and userland and if an out of memory condition happens it's
    unlikely to happen exactly here). Nevertheless it's safer to ensure that
    no huge pmd can be left around if the vma is adjusted in a way that can't
    fit hugepages anymore at the new vm_start/vm_end address.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Allow to choose between the always|madvise default for page faults and
    khugepaged at config time. madvise guarantees zero risk of higher memory
    footprint for applications (applications using madvise(MADV_HUGEPAGE)
    won't risk to use any more memory by backing their virtual regions with
    hugepages).

    Initially set the default to N and don't depend on EMBEDDED.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This tries to be more friendly to filesystem in userland, with userland
    backends that allocate memory in the I/O paths and that could deadlock if
    khugepaged holds the mmap_sem write mode of the userland backend while
    allocating memory. Memory allocation may wait for writeback I/O
    completion from the daemon that may be blocked in the mmap_sem read mode
    if a page fault happens and the daemon wasn't using mlock for the memory
    required for the I/O submission and completion.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
    introducing alloc_pages_vma. khugepaged needs special handling as the
    allocation has to happen inside collapse_huge_page where the vma is known
    and an error has to be returned to the outer loop to sleep
    alloc_sleep_millisecs in case of failure. But it retains the more
    efficient logic of handling allocation failures in khugepaged in case of
    CONFIG_NUMA=n.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With memory compaction in, and lumpy-reclaim disabled, it seems safe
    enough to defrag memory during the (synchronous) transparent hugepage page
    faults (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged
    (async) hugepage allocations that was already enabled even before memory
    compaction was in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If transparent hugepage is enabled initialize min_free_kbytes to an
    optimal value by default. This moves the hugeadm algorithm in kernel.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Natively handle huge pmds when changing page tables on behalf of
    mprotect().

    I left out update_mmu_cache() because we do not need it on x86 anyway but
    more importantly the interface works on ptes, not pmds.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Handle transparent huge page pmd entries natively instead of splitting
    them into subpages.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Add khugepaged to relocate fragmented pages into hugepages if new
    hugepages become available. (this is indipendent of the defrag logic that
    will have to make new hugepages available)

    The fundamental reason why khugepaged is unavoidable, is that some memory
    can be fragmented and not everything can be relocated. So when a virtual
    machine quits and releases gigabytes of hugepages, we want to use those
    freely available hugepages to create huge-pmd in the other virtual
    machines that may be running on fragmented memory, to maximize the CPU
    efficiency at all times. The scan is slow, it takes nearly zero cpu time,
    except when it copies data (in which case it means we definitely want to
    pay for that cpu time) so it seems a good tradeoff.

    In addition to the hugepages being released by other process releasing
    memory, we have the strong suspicion that the performance impact of
    potentially defragmenting hugepages during or before each page fault could
    lead to more performance inconsistency than allocating small pages at
    first and having them collapsed into large pages later... if they prove
    themselfs to be long lived mappings (khugepaged scan is slow so short
    lived mappings have low probability to run into khugepaged if compared to
    long lived mappings).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add hugepage stat information to /proc/vmstat and /proc/meminfo.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add memcg charge/uncharge to hugepage faults in huge_memory.c.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_HUGEPAGE to mark regions that are important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This documents how split_huge_page is safe vs new vma inserctions into the
    anon_vma that may have already released the anon_vma->lock but not
    established pmds yet when split_huge_page starts.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli