Commit 71e3aac0724ffe8918992d76acfe3aad7d8724a5

Authored by Andrea Arcangeli
Committed by Linus Torvalds
1 parent 5c3240d92e

thp: transparent hugepage core

Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...).  Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario.  So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas.  This is what this patch tries to achieve in the
least intrusive possible way.  We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).

The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails!  This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail.  This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM.  Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*.  The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle).  In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon.  It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later.  collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault
time.  This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head.  I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view.  In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...).  And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.

If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet).  But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;).  MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.

NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores.  I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks.  One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault).  Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only.  If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot.  If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time.  It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone.  Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation.  hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits.  hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 14 changed files with 1220 additions and 35 deletions Side-by-side Diff

arch/x86/include/asm/pgtable_64.h
... ... @@ -286,6 +286,11 @@
286 286 return pmd_set_flags(pmd, _PAGE_RW);
287 287 }
288 288  
  289 +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
  290 +{
  291 + return pmd_clear_flags(pmd, _PAGE_PRESENT);
  292 +}
  293 +
289 294 #endif /* !__ASSEMBLY__ */
290 295  
291 296 #endif /* _ASM_X86_PGTABLE_64_H */
... ... @@ -109,6 +109,9 @@
109 109 __GFP_HARDWALL | __GFP_HIGHMEM | \
110 110 __GFP_MOVABLE)
111 111 #define GFP_IOFS (__GFP_IO | __GFP_FS)
  112 +#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
  113 + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
  114 + __GFP_NO_KSWAPD)
112 115  
113 116 #ifdef CONFIG_NUMA
114 117 #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
include/linux/huge_mm.h
  1 +#ifndef _LINUX_HUGE_MM_H
  2 +#define _LINUX_HUGE_MM_H
  3 +
  4 +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  5 + struct vm_area_struct *vma,
  6 + unsigned long address, pmd_t *pmd,
  7 + unsigned int flags);
  8 +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  9 + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
  10 + struct vm_area_struct *vma);
  11 +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
  12 + unsigned long address, pmd_t *pmd,
  13 + pmd_t orig_pmd);
  14 +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
  15 +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
  16 + unsigned long addr,
  17 + pmd_t *pmd,
  18 + unsigned int flags);
  19 +extern int zap_huge_pmd(struct mmu_gather *tlb,
  20 + struct vm_area_struct *vma,
  21 + pmd_t *pmd);
  22 +
  23 +enum transparent_hugepage_flag {
  24 + TRANSPARENT_HUGEPAGE_FLAG,
  25 + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
  26 + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
  27 + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
  28 +#ifdef CONFIG_DEBUG_VM
  29 + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
  30 +#endif
  31 +};
  32 +
  33 +enum page_check_address_pmd_flag {
  34 + PAGE_CHECK_ADDRESS_PMD_FLAG,
  35 + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
  36 + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
  37 +};
  38 +extern pmd_t *page_check_address_pmd(struct page *page,
  39 + struct mm_struct *mm,
  40 + unsigned long address,
  41 + enum page_check_address_pmd_flag flag);
  42 +
  43 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  44 +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
  45 +#define HPAGE_PMD_MASK HPAGE_MASK
  46 +#define HPAGE_PMD_SIZE HPAGE_SIZE
  47 +
  48 +#define transparent_hugepage_enabled(__vma) \
  49 + (transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) || \
  50 + (transparent_hugepage_flags & \
  51 + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
  52 + (__vma)->vm_flags & VM_HUGEPAGE))
  53 +#define transparent_hugepage_defrag(__vma) \
  54 + ((transparent_hugepage_flags & \
  55 + (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
  56 + (transparent_hugepage_flags & \
  57 + (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) && \
  58 + (__vma)->vm_flags & VM_HUGEPAGE))
  59 +#ifdef CONFIG_DEBUG_VM
  60 +#define transparent_hugepage_debug_cow() \
  61 + (transparent_hugepage_flags & \
  62 + (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
  63 +#else /* CONFIG_DEBUG_VM */
  64 +#define transparent_hugepage_debug_cow() 0
  65 +#endif /* CONFIG_DEBUG_VM */
  66 +
  67 +extern unsigned long transparent_hugepage_flags;
  68 +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  69 + pmd_t *dst_pmd, pmd_t *src_pmd,
  70 + struct vm_area_struct *vma,
  71 + unsigned long addr, unsigned long end);
  72 +extern int handle_pte_fault(struct mm_struct *mm,
  73 + struct vm_area_struct *vma, unsigned long address,
  74 + pte_t *pte, pmd_t *pmd, unsigned int flags);
  75 +extern int split_huge_page(struct page *page);
  76 +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
  77 +#define split_huge_page_pmd(__mm, __pmd) \
  78 + do { \
  79 + pmd_t *____pmd = (__pmd); \
  80 + if (unlikely(pmd_trans_huge(*____pmd))) \
  81 + __split_huge_page_pmd(__mm, ____pmd); \
  82 + } while (0)
  83 +#define wait_split_huge_page(__anon_vma, __pmd) \
  84 + do { \
  85 + pmd_t *____pmd = (__pmd); \
  86 + spin_unlock_wait(&(__anon_vma)->root->lock); \
  87 + /* \
  88 + * spin_unlock_wait() is just a loop in C and so the \
  89 + * CPU can reorder anything around it. \
  90 + */ \
  91 + smp_mb(); \
  92 + BUG_ON(pmd_trans_splitting(*____pmd) || \
  93 + pmd_trans_huge(*____pmd)); \
  94 + } while (0)
  95 +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
  96 +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
  97 +#if HPAGE_PMD_ORDER > MAX_ORDER
  98 +#error "hugepages can't be allocated by the buddy allocator"
  99 +#endif
  100 +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
  101 +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
  102 +#define HPAGE_PMD_MASK ({ BUG(); 0; })
  103 +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
  104 +
  105 +#define transparent_hugepage_enabled(__vma) 0
  106 +
  107 +#define transparent_hugepage_flags 0UL
  108 +static inline int split_huge_page(struct page *page)
  109 +{
  110 + return 0;
  111 +}
  112 +#define split_huge_page_pmd(__mm, __pmd) \
  113 + do { } while (0)
  114 +#define wait_split_huge_page(__anon_vma, __pmd) \
  115 + do { } while (0)
  116 +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  117 +
  118 +#endif /* _LINUX_HUGE_MM_H */
... ... @@ -111,6 +111,9 @@
111 111 #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */
112 112 #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
113 113 #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
  114 +#if BITS_PER_LONG > 32
  115 +#define VM_HUGEPAGE 0x100000000UL /* MADV_HUGEPAGE marked this vma */
  116 +#endif
114 117  
115 118 /* Bits set in the VMA until the stack is in its final location */
116 119 #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
... ... @@ -243,6 +246,7 @@
243 246 * files which need it (119 of them)
244 247 */
245 248 #include <linux/page-flags.h>
  249 +#include <linux/huge_mm.h>
246 250  
247 251 /*
248 252 * Methods to modify the page usage count.
include/linux/mm_inline.h
... ... @@ -20,11 +20,18 @@
20 20 }
21 21  
22 22 static inline void
23   -add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
  23 +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
  24 + struct list_head *head)
24 25 {
25   - list_add(&page->lru, &zone->lru[l].list);
  26 + list_add(&page->lru, head);
26 27 __inc_zone_state(zone, NR_LRU_BASE + l);
27 28 mem_cgroup_add_lru_list(page, l);
  29 +}
  30 +
  31 +static inline void
  32 +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
  33 +{
  34 + __add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
28 35 }
29 36  
30 37 static inline void
include/linux/page-flags.h
... ... @@ -410,11 +410,32 @@
410 410 #endif /* !PAGEFLAGS_EXTENDED */
411 411  
412 412 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  413 +/*
  414 + * PageHuge() only returns true for hugetlbfs pages, but not for
  415 + * normal or transparent huge pages.
  416 + *
  417 + * PageTransHuge() returns true for both transparent huge and
  418 + * hugetlbfs pages, but not normal pages. PageTransHuge() can only be
  419 + * called only in the core VM paths where hugetlbfs pages can't exist.
  420 + */
  421 +static inline int PageTransHuge(struct page *page)
  422 +{
  423 + VM_BUG_ON(PageTail(page));
  424 + return PageHead(page);
  425 +}
  426 +
413 427 static inline int PageTransCompound(struct page *page)
414 428 {
415 429 return PageCompound(page);
416 430 }
  431 +
417 432 #else
  433 +
  434 +static inline int PageTransHuge(struct page *page)
  435 +{
  436 + return 0;
  437 +}
  438 +
418 439 static inline int PageTransCompound(struct page *page)
419 440 {
420 441 return 0;
include/linux/rmap.h
... ... @@ -198,6 +198,8 @@
198 198 };
199 199 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
200 200  
  201 +bool is_vma_temporary_stack(struct vm_area_struct *vma);
  202 +
201 203 int try_to_unmap(struct page *, enum ttu_flags flags);
202 204 int try_to_unmap_one(struct page *, struct vm_area_struct *,
203 205 unsigned long address, enum ttu_flags flags);
include/linux/swap.h
... ... @@ -208,6 +208,8 @@
208 208 /* linux/mm/swap.c */
209 209 extern void __lru_cache_add(struct page *, enum lru_list lru);
210 210 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
  211 +extern void lru_add_page_tail(struct zone* zone,
  212 + struct page *page, struct page *page_tail);
211 213 extern void activate_page(struct page *);
212 214 extern void mark_page_accessed(struct page *);
213 215 extern void lru_add_drain(void);
... ... @@ -37,6 +37,7 @@
37 37 obj-$(CONFIG_FS_XIP) += filemap_xip.o
38 38 obj-$(CONFIG_MIGRATION) += migrate.o
39 39 obj-$(CONFIG_QUICKLIST) += quicklist.o
  40 +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
40 41 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
41 42 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
42 43 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
  1 +/*
  2 + * Copyright (C) 2009 Red Hat, Inc.
  3 + *
  4 + * This work is licensed under the terms of the GNU GPL, version 2. See
  5 + * the COPYING file in the top-level directory.
  6 + */
  7 +
  8 +#include <linux/mm.h>
  9 +#include <linux/sched.h>
  10 +#include <linux/highmem.h>
  11 +#include <linux/hugetlb.h>
  12 +#include <linux/mmu_notifier.h>
  13 +#include <linux/rmap.h>
  14 +#include <linux/swap.h>
  15 +#include <asm/tlb.h>
  16 +#include <asm/pgalloc.h>
  17 +#include "internal.h"
  18 +
  19 +unsigned long transparent_hugepage_flags __read_mostly =
  20 + (1<<TRANSPARENT_HUGEPAGE_FLAG);
  21 +
  22 +#ifdef CONFIG_SYSFS
  23 +static ssize_t double_flag_show(struct kobject *kobj,
  24 + struct kobj_attribute *attr, char *buf,
  25 + enum transparent_hugepage_flag enabled,
  26 + enum transparent_hugepage_flag req_madv)
  27 +{
  28 + if (test_bit(enabled, &transparent_hugepage_flags)) {
  29 + VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
  30 + return sprintf(buf, "[always] madvise never\n");
  31 + } else if (test_bit(req_madv, &transparent_hugepage_flags))
  32 + return sprintf(buf, "always [madvise] never\n");
  33 + else
  34 + return sprintf(buf, "always madvise [never]\n");
  35 +}
  36 +static ssize_t double_flag_store(struct kobject *kobj,
  37 + struct kobj_attribute *attr,
  38 + const char *buf, size_t count,
  39 + enum transparent_hugepage_flag enabled,
  40 + enum transparent_hugepage_flag req_madv)
  41 +{
  42 + if (!memcmp("always", buf,
  43 + min(sizeof("always")-1, count))) {
  44 + set_bit(enabled, &transparent_hugepage_flags);
  45 + clear_bit(req_madv, &transparent_hugepage_flags);
  46 + } else if (!memcmp("madvise", buf,
  47 + min(sizeof("madvise")-1, count))) {
  48 + clear_bit(enabled, &transparent_hugepage_flags);
  49 + set_bit(req_madv, &transparent_hugepage_flags);
  50 + } else if (!memcmp("never", buf,
  51 + min(sizeof("never")-1, count))) {
  52 + clear_bit(enabled, &transparent_hugepage_flags);
  53 + clear_bit(req_madv, &transparent_hugepage_flags);
  54 + } else
  55 + return -EINVAL;
  56 +
  57 + return count;
  58 +}
  59 +
  60 +static ssize_t enabled_show(struct kobject *kobj,
  61 + struct kobj_attribute *attr, char *buf)
  62 +{
  63 + return double_flag_show(kobj, attr, buf,
  64 + TRANSPARENT_HUGEPAGE_FLAG,
  65 + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
  66 +}
  67 +static ssize_t enabled_store(struct kobject *kobj,
  68 + struct kobj_attribute *attr,
  69 + const char *buf, size_t count)
  70 +{
  71 + return double_flag_store(kobj, attr, buf, count,
  72 + TRANSPARENT_HUGEPAGE_FLAG,
  73 + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
  74 +}
  75 +static struct kobj_attribute enabled_attr =
  76 + __ATTR(enabled, 0644, enabled_show, enabled_store);
  77 +
  78 +static ssize_t single_flag_show(struct kobject *kobj,
  79 + struct kobj_attribute *attr, char *buf,
  80 + enum transparent_hugepage_flag flag)
  81 +{
  82 + if (test_bit(flag, &transparent_hugepage_flags))
  83 + return sprintf(buf, "[yes] no\n");
  84 + else
  85 + return sprintf(buf, "yes [no]\n");
  86 +}
  87 +static ssize_t single_flag_store(struct kobject *kobj,
  88 + struct kobj_attribute *attr,
  89 + const char *buf, size_t count,
  90 + enum transparent_hugepage_flag flag)
  91 +{
  92 + if (!memcmp("yes", buf,
  93 + min(sizeof("yes")-1, count))) {
  94 + set_bit(flag, &transparent_hugepage_flags);
  95 + } else if (!memcmp("no", buf,
  96 + min(sizeof("no")-1, count))) {
  97 + clear_bit(flag, &transparent_hugepage_flags);
  98 + } else
  99 + return -EINVAL;
  100 +
  101 + return count;
  102 +}
  103 +
  104 +/*
  105 + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
  106 + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
  107 + * memory just to allocate one more hugepage.
  108 + */
  109 +static ssize_t defrag_show(struct kobject *kobj,
  110 + struct kobj_attribute *attr, char *buf)
  111 +{
  112 + return double_flag_show(kobj, attr, buf,
  113 + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
  114 + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
  115 +}
  116 +static ssize_t defrag_store(struct kobject *kobj,
  117 + struct kobj_attribute *attr,
  118 + const char *buf, size_t count)
  119 +{
  120 + return double_flag_store(kobj, attr, buf, count,
  121 + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
  122 + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
  123 +}
  124 +static struct kobj_attribute defrag_attr =
  125 + __ATTR(defrag, 0644, defrag_show, defrag_store);
  126 +
  127 +#ifdef CONFIG_DEBUG_VM
  128 +static ssize_t debug_cow_show(struct kobject *kobj,
  129 + struct kobj_attribute *attr, char *buf)
  130 +{
  131 + return single_flag_show(kobj, attr, buf,
  132 + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
  133 +}
  134 +static ssize_t debug_cow_store(struct kobject *kobj,
  135 + struct kobj_attribute *attr,
  136 + const char *buf, size_t count)
  137 +{
  138 + return single_flag_store(kobj, attr, buf, count,
  139 + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
  140 +}
  141 +static struct kobj_attribute debug_cow_attr =
  142 + __ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
  143 +#endif /* CONFIG_DEBUG_VM */
  144 +
  145 +static struct attribute *hugepage_attr[] = {
  146 + &enabled_attr.attr,
  147 + &defrag_attr.attr,
  148 +#ifdef CONFIG_DEBUG_VM
  149 + &debug_cow_attr.attr,
  150 +#endif
  151 + NULL,
  152 +};
  153 +
  154 +static struct attribute_group hugepage_attr_group = {
  155 + .attrs = hugepage_attr,
  156 + .name = "transparent_hugepage",
  157 +};
  158 +#endif /* CONFIG_SYSFS */
  159 +
  160 +static int __init hugepage_init(void)
  161 +{
  162 +#ifdef CONFIG_SYSFS
  163 + int err;
  164 +
  165 + err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
  166 + if (err)
  167 + printk(KERN_ERR "hugepage: register sysfs failed\n");
  168 +#endif
  169 + return 0;
  170 +}
  171 +module_init(hugepage_init)
  172 +
  173 +static int __init setup_transparent_hugepage(char *str)
  174 +{
  175 + int ret = 0;
  176 + if (!str)
  177 + goto out;
  178 + if (!strcmp(str, "always")) {
  179 + set_bit(TRANSPARENT_HUGEPAGE_FLAG,
  180 + &transparent_hugepage_flags);
  181 + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
  182 + &transparent_hugepage_flags);
  183 + ret = 1;
  184 + } else if (!strcmp(str, "madvise")) {
  185 + clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
  186 + &transparent_hugepage_flags);
  187 + set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
  188 + &transparent_hugepage_flags);
  189 + ret = 1;
  190 + } else if (!strcmp(str, "never")) {
  191 + clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
  192 + &transparent_hugepage_flags);
  193 + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
  194 + &transparent_hugepage_flags);
  195 + ret = 1;
  196 + }
  197 +out:
  198 + if (!ret)
  199 + printk(KERN_WARNING
  200 + "transparent_hugepage= cannot parse, ignored\n");
  201 + return ret;
  202 +}
  203 +__setup("transparent_hugepage=", setup_transparent_hugepage);
  204 +
  205 +static void prepare_pmd_huge_pte(pgtable_t pgtable,
  206 + struct mm_struct *mm)
  207 +{
  208 + assert_spin_locked(&mm->page_table_lock);
  209 +
  210 + /* FIFO */
  211 + if (!mm->pmd_huge_pte)
  212 + INIT_LIST_HEAD(&pgtable->lru);
  213 + else
  214 + list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
  215 + mm->pmd_huge_pte = pgtable;
  216 +}
  217 +
  218 +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
  219 +{
  220 + if (likely(vma->vm_flags & VM_WRITE))
  221 + pmd = pmd_mkwrite(pmd);
  222 + return pmd;
  223 +}
  224 +
  225 +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
  226 + struct vm_area_struct *vma,
  227 + unsigned long haddr, pmd_t *pmd,
  228 + struct page *page)
  229 +{
  230 + int ret = 0;
  231 + pgtable_t pgtable;
  232 +
  233 + VM_BUG_ON(!PageCompound(page));
  234 + pgtable = pte_alloc_one(mm, haddr);
  235 + if (unlikely(!pgtable)) {
  236 + put_page(page);
  237 + return VM_FAULT_OOM;
  238 + }
  239 +
  240 + clear_huge_page(page, haddr, HPAGE_PMD_NR);
  241 + __SetPageUptodate(page);
  242 +
  243 + spin_lock(&mm->page_table_lock);
  244 + if (unlikely(!pmd_none(*pmd))) {
  245 + spin_unlock(&mm->page_table_lock);
  246 + put_page(page);
  247 + pte_free(mm, pgtable);
  248 + } else {
  249 + pmd_t entry;
  250 + entry = mk_pmd(page, vma->vm_page_prot);
  251 + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
  252 + entry = pmd_mkhuge(entry);
  253 + /*
  254 + * The spinlocking to take the lru_lock inside
  255 + * page_add_new_anon_rmap() acts as a full memory
  256 + * barrier to be sure clear_huge_page writes become
  257 + * visible after the set_pmd_at() write.
  258 + */
  259 + page_add_new_anon_rmap(page, vma, haddr);
  260 + set_pmd_at(mm, haddr, pmd, entry);
  261 + prepare_pmd_huge_pte(pgtable, mm);
  262 + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
  263 + spin_unlock(&mm->page_table_lock);
  264 + }
  265 +
  266 + return ret;
  267 +}
  268 +
  269 +static inline struct page *alloc_hugepage(int defrag)
  270 +{
  271 + return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
  272 + HPAGE_PMD_ORDER);
  273 +}
  274 +
  275 +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
  276 + unsigned long address, pmd_t *pmd,
  277 + unsigned int flags)
  278 +{
  279 + struct page *page;
  280 + unsigned long haddr = address & HPAGE_PMD_MASK;
  281 + pte_t *pte;
  282 +
  283 + if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
  284 + if (unlikely(anon_vma_prepare(vma)))
  285 + return VM_FAULT_OOM;
  286 + page = alloc_hugepage(transparent_hugepage_defrag(vma));
  287 + if (unlikely(!page))
  288 + goto out;
  289 +
  290 + return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
  291 + }
  292 +out:
  293 + /*
  294 + * Use __pte_alloc instead of pte_alloc_map, because we can't
  295 + * run pte_offset_map on the pmd, if an huge pmd could
  296 + * materialize from under us from a different thread.
  297 + */
  298 + if (unlikely(__pte_alloc(mm, vma, pmd, address)))
  299 + return VM_FAULT_OOM;
  300 + /* if an huge pmd materialized from under us just retry later */
  301 + if (unlikely(pmd_trans_huge(*pmd)))
  302 + return 0;
  303 + /*
  304 + * A regular pmd is established and it can't morph into a huge pmd
  305 + * from under us anymore at this point because we hold the mmap_sem
  306 + * read mode and khugepaged takes it in write mode. So now it's
  307 + * safe to run pte_offset_map().
  308 + */
  309 + pte = pte_offset_map(pmd, address);
  310 + return handle_pte_fault(mm, vma, address, pte, pmd, flags);
  311 +}
  312 +
  313 +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  314 + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
  315 + struct vm_area_struct *vma)
  316 +{
  317 + struct page *src_page;
  318 + pmd_t pmd;
  319 + pgtable_t pgtable;
  320 + int ret;
  321 +
  322 + ret = -ENOMEM;
  323 + pgtable = pte_alloc_one(dst_mm, addr);
  324 + if (unlikely(!pgtable))
  325 + goto out;
  326 +
  327 + spin_lock(&dst_mm->page_table_lock);
  328 + spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
  329 +
  330 + ret = -EAGAIN;
  331 + pmd = *src_pmd;
  332 + if (unlikely(!pmd_trans_huge(pmd))) {
  333 + pte_free(dst_mm, pgtable);
  334 + goto out_unlock;
  335 + }
  336 + if (unlikely(pmd_trans_splitting(pmd))) {
  337 + /* split huge page running from under us */
  338 + spin_unlock(&src_mm->page_table_lock);
  339 + spin_unlock(&dst_mm->page_table_lock);
  340 + pte_free(dst_mm, pgtable);
  341 +
  342 + wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
  343 + goto out;
  344 + }
  345 + src_page = pmd_page(pmd);
  346 + VM_BUG_ON(!PageHead(src_page));
  347 + get_page(src_page);
  348 + page_dup_rmap(src_page);
  349 + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
  350 +
  351 + pmdp_set_wrprotect(src_mm, addr, src_pmd);
  352 + pmd = pmd_mkold(pmd_wrprotect(pmd));
  353 + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
  354 + prepare_pmd_huge_pte(pgtable, dst_mm);
  355 +
  356 + ret = 0;
  357 +out_unlock:
  358 + spin_unlock(&src_mm->page_table_lock);
  359 + spin_unlock(&dst_mm->page_table_lock);
  360 +out:
  361 + return ret;
  362 +}
  363 +
  364 +/* no "address" argument so destroys page coloring of some arch */
  365 +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
  366 +{
  367 + pgtable_t pgtable;
  368 +
  369 + assert_spin_locked(&mm->page_table_lock);
  370 +
  371 + /* FIFO */
  372 + pgtable = mm->pmd_huge_pte;
  373 + if (list_empty(&pgtable->lru))
  374 + mm->pmd_huge_pte = NULL;
  375 + else {
  376 + mm->pmd_huge_pte = list_entry(pgtable->lru.next,
  377 + struct page, lru);
  378 + list_del(&pgtable->lru);
  379 + }
  380 + return pgtable;
  381 +}
  382 +
  383 +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
  384 + struct vm_area_struct *vma,
  385 + unsigned long address,
  386 + pmd_t *pmd, pmd_t orig_pmd,
  387 + struct page *page,
  388 + unsigned long haddr)
  389 +{
  390 + pgtable_t pgtable;
  391 + pmd_t _pmd;
  392 + int ret = 0, i;
  393 + struct page **pages;
  394 +
  395 + pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
  396 + GFP_KERNEL);
  397 + if (unlikely(!pages)) {
  398 + ret |= VM_FAULT_OOM;
  399 + goto out;
  400 + }
  401 +
  402 + for (i = 0; i < HPAGE_PMD_NR; i++) {
  403 + pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
  404 + vma, address);
  405 + if (unlikely(!pages[i])) {
  406 + while (--i >= 0)
  407 + put_page(pages[i]);
  408 + kfree(pages);
  409 + ret |= VM_FAULT_OOM;
  410 + goto out;
  411 + }
  412 + }
  413 +
  414 + for (i = 0; i < HPAGE_PMD_NR; i++) {
  415 + copy_user_highpage(pages[i], page + i,
  416 + haddr + PAGE_SHIFT*i, vma);
  417 + __SetPageUptodate(pages[i]);
  418 + cond_resched();
  419 + }
  420 +
  421 + spin_lock(&mm->page_table_lock);
  422 + if (unlikely(!pmd_same(*pmd, orig_pmd)))
  423 + goto out_free_pages;
  424 + VM_BUG_ON(!PageHead(page));
  425 +
  426 + pmdp_clear_flush_notify(vma, haddr, pmd);
  427 + /* leave pmd empty until pte is filled */
  428 +
  429 + pgtable = get_pmd_huge_pte(mm);
  430 + pmd_populate(mm, &_pmd, pgtable);
  431 +
  432 + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
  433 + pte_t *pte, entry;
  434 + entry = mk_pte(pages[i], vma->vm_page_prot);
  435 + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
  436 + page_add_new_anon_rmap(pages[i], vma, haddr);
  437 + pte = pte_offset_map(&_pmd, haddr);
  438 + VM_BUG_ON(!pte_none(*pte));
  439 + set_pte_at(mm, haddr, pte, entry);
  440 + pte_unmap(pte);
  441 + }
  442 + kfree(pages);
  443 +
  444 + mm->nr_ptes++;
  445 + smp_wmb(); /* make pte visible before pmd */
  446 + pmd_populate(mm, pmd, pgtable);
  447 + page_remove_rmap(page);
  448 + spin_unlock(&mm->page_table_lock);
  449 +
  450 + ret |= VM_FAULT_WRITE;
  451 + put_page(page);
  452 +
  453 +out:
  454 + return ret;
  455 +
  456 +out_free_pages:
  457 + spin_unlock(&mm->page_table_lock);
  458 + for (i = 0; i < HPAGE_PMD_NR; i++)
  459 + put_page(pages[i]);
  460 + kfree(pages);
  461 + goto out;
  462 +}
  463 +
  464 +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
  465 + unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
  466 +{
  467 + int ret = 0;
  468 + struct page *page, *new_page;
  469 + unsigned long haddr;
  470 +
  471 + VM_BUG_ON(!vma->anon_vma);
  472 + spin_lock(&mm->page_table_lock);
  473 + if (unlikely(!pmd_same(*pmd, orig_pmd)))
  474 + goto out_unlock;
  475 +
  476 + page = pmd_page(orig_pmd);
  477 + VM_BUG_ON(!PageCompound(page) || !PageHead(page));
  478 + haddr = address & HPAGE_PMD_MASK;
  479 + if (page_mapcount(page) == 1) {
  480 + pmd_t entry;
  481 + entry = pmd_mkyoung(orig_pmd);
  482 + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
  483 + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
  484 + update_mmu_cache(vma, address, entry);
  485 + ret |= VM_FAULT_WRITE;
  486 + goto out_unlock;
  487 + }
  488 + get_page(page);
  489 + spin_unlock(&mm->page_table_lock);
  490 +
  491 + if (transparent_hugepage_enabled(vma) &&
  492 + !transparent_hugepage_debug_cow())
  493 + new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
  494 + else
  495 + new_page = NULL;
  496 +
  497 + if (unlikely(!new_page)) {
  498 + ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
  499 + pmd, orig_pmd, page, haddr);
  500 + put_page(page);
  501 + goto out;
  502 + }
  503 +
  504 + copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
  505 + __SetPageUptodate(new_page);
  506 +
  507 + spin_lock(&mm->page_table_lock);
  508 + put_page(page);
  509 + if (unlikely(!pmd_same(*pmd, orig_pmd)))
  510 + put_page(new_page);
  511 + else {
  512 + pmd_t entry;
  513 + VM_BUG_ON(!PageHead(page));
  514 + entry = mk_pmd(new_page, vma->vm_page_prot);
  515 + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
  516 + entry = pmd_mkhuge(entry);
  517 + pmdp_clear_flush_notify(vma, haddr, pmd);
  518 + page_add_new_anon_rmap(new_page, vma, haddr);
  519 + set_pmd_at(mm, haddr, pmd, entry);
  520 + update_mmu_cache(vma, address, entry);
  521 + page_remove_rmap(page);
  522 + put_page(page);
  523 + ret |= VM_FAULT_WRITE;
  524 + }
  525 +out_unlock:
  526 + spin_unlock(&mm->page_table_lock);
  527 +out:
  528 + return ret;
  529 +}
  530 +
  531 +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
  532 + unsigned long addr,
  533 + pmd_t *pmd,
  534 + unsigned int flags)
  535 +{
  536 + struct page *page = NULL;
  537 +
  538 + assert_spin_locked(&mm->page_table_lock);
  539 +
  540 + if (flags & FOLL_WRITE && !pmd_write(*pmd))
  541 + goto out;
  542 +
  543 + page = pmd_page(*pmd);
  544 + VM_BUG_ON(!PageHead(page));
  545 + if (flags & FOLL_TOUCH) {
  546 + pmd_t _pmd;
  547 + /*
  548 + * We should set the dirty bit only for FOLL_WRITE but
  549 + * for now the dirty bit in the pmd is meaningless.
  550 + * And if the dirty bit will become meaningful and
  551 + * we'll only set it with FOLL_WRITE, an atomic
  552 + * set_bit will be required on the pmd to set the
  553 + * young bit, instead of the current set_pmd_at.
  554 + */
  555 + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
  556 + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
  557 + }
  558 + page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
  559 + VM_BUG_ON(!PageCompound(page));
  560 + if (flags & FOLL_GET)
  561 + get_page(page);
  562 +
  563 +out:
  564 + return page;
  565 +}
  566 +
  567 +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
  568 + pmd_t *pmd)
  569 +{
  570 + int ret = 0;
  571 +
  572 + spin_lock(&tlb->mm->page_table_lock);
  573 + if (likely(pmd_trans_huge(*pmd))) {
  574 + if (unlikely(pmd_trans_splitting(*pmd))) {
  575 + spin_unlock(&tlb->mm->page_table_lock);
  576 + wait_split_huge_page(vma->anon_vma,
  577 + pmd);
  578 + } else {
  579 + struct page *page;
  580 + pgtable_t pgtable;
  581 + pgtable = get_pmd_huge_pte(tlb->mm);
  582 + page = pmd_page(*pmd);
  583 + pmd_clear(pmd);
  584 + page_remove_rmap(page);
  585 + VM_BUG_ON(page_mapcount(page) < 0);
  586 + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
  587 + VM_BUG_ON(!PageHead(page));
  588 + spin_unlock(&tlb->mm->page_table_lock);
  589 + tlb_remove_page(tlb, page);
  590 + pte_free(tlb->mm, pgtable);
  591 + ret = 1;
  592 + }
  593 + } else
  594 + spin_unlock(&tlb->mm->page_table_lock);
  595 +
  596 + return ret;
  597 +}
  598 +
  599 +pmd_t *page_check_address_pmd(struct page *page,
  600 + struct mm_struct *mm,
  601 + unsigned long address,
  602 + enum page_check_address_pmd_flag flag)
  603 +{
  604 + pgd_t *pgd;
  605 + pud_t *pud;
  606 + pmd_t *pmd, *ret = NULL;
  607 +
  608 + if (address & ~HPAGE_PMD_MASK)
  609 + goto out;
  610 +
  611 + pgd = pgd_offset(mm, address);
  612 + if (!pgd_present(*pgd))
  613 + goto out;
  614 +
  615 + pud = pud_offset(pgd, address);
  616 + if (!pud_present(*pud))
  617 + goto out;
  618 +
  619 + pmd = pmd_offset(pud, address);
  620 + if (pmd_none(*pmd))
  621 + goto out;
  622 + if (pmd_page(*pmd) != page)
  623 + goto out;
  624 + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
  625 + pmd_trans_splitting(*pmd));
  626 + if (pmd_trans_huge(*pmd)) {
  627 + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
  628 + !pmd_trans_splitting(*pmd));
  629 + ret = pmd;
  630 + }
  631 +out:
  632 + return ret;
  633 +}
  634 +
  635 +static int __split_huge_page_splitting(struct page *page,
  636 + struct vm_area_struct *vma,
  637 + unsigned long address)
  638 +{
  639 + struct mm_struct *mm = vma->vm_mm;
  640 + pmd_t *pmd;
  641 + int ret = 0;
  642 +
  643 + spin_lock(&mm->page_table_lock);
  644 + pmd = page_check_address_pmd(page, mm, address,
  645 + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
  646 + if (pmd) {
  647 + /*
  648 + * We can't temporarily set the pmd to null in order
  649 + * to split it, the pmd must remain marked huge at all
  650 + * times or the VM won't take the pmd_trans_huge paths
  651 + * and it won't wait on the anon_vma->root->lock to
  652 + * serialize against split_huge_page*.
  653 + */
  654 + pmdp_splitting_flush_notify(vma, address, pmd);
  655 + ret = 1;
  656 + }
  657 + spin_unlock(&mm->page_table_lock);
  658 +
  659 + return ret;
  660 +}
  661 +
  662 +static void __split_huge_page_refcount(struct page *page)
  663 +{
  664 + int i;
  665 + unsigned long head_index = page->index;
  666 + struct zone *zone = page_zone(page);
  667 +
  668 + /* prevent PageLRU to go away from under us, and freeze lru stats */
  669 + spin_lock_irq(&zone->lru_lock);
  670 + compound_lock(page);
  671 +
  672 + for (i = 1; i < HPAGE_PMD_NR; i++) {
  673 + struct page *page_tail = page + i;
  674 +
  675 + /* tail_page->_count cannot change */
  676 + atomic_sub(atomic_read(&page_tail->_count), &page->_count);
  677 + BUG_ON(page_count(page) <= 0);
  678 + atomic_add(page_mapcount(page) + 1, &page_tail->_count);
  679 + BUG_ON(atomic_read(&page_tail->_count) <= 0);
  680 +
  681 + /* after clearing PageTail the gup refcount can be released */
  682 + smp_mb();
  683 +
  684 + page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
  685 + page_tail->flags |= (page->flags &
  686 + ((1L << PG_referenced) |
  687 + (1L << PG_swapbacked) |
  688 + (1L << PG_mlocked) |
  689 + (1L << PG_uptodate)));
  690 + page_tail->flags |= (1L << PG_dirty);
  691 +
  692 + /*
  693 + * 1) clear PageTail before overwriting first_page
  694 + * 2) clear PageTail before clearing PageHead for VM_BUG_ON
  695 + */
  696 + smp_wmb();
  697 +
  698 + /*
  699 + * __split_huge_page_splitting() already set the
  700 + * splitting bit in all pmd that could map this
  701 + * hugepage, that will ensure no CPU can alter the
  702 + * mapcount on the head page. The mapcount is only
  703 + * accounted in the head page and it has to be
  704 + * transferred to all tail pages in the below code. So
  705 + * for this code to be safe, the split the mapcount
  706 + * can't change. But that doesn't mean userland can't
  707 + * keep changing and reading the page contents while
  708 + * we transfer the mapcount, so the pmd splitting
  709 + * status is achieved setting a reserved bit in the
  710 + * pmd, not by clearing the present bit.
  711 + */
  712 + BUG_ON(page_mapcount(page_tail));
  713 + page_tail->_mapcount = page->_mapcount;
  714 +
  715 + BUG_ON(page_tail->mapping);
  716 + page_tail->mapping = page->mapping;
  717 +
  718 + page_tail->index = ++head_index;
  719 +
  720 + BUG_ON(!PageAnon(page_tail));
  721 + BUG_ON(!PageUptodate(page_tail));
  722 + BUG_ON(!PageDirty(page_tail));
  723 + BUG_ON(!PageSwapBacked(page_tail));
  724 +
  725 + lru_add_page_tail(zone, page, page_tail);
  726 + }
  727 +
  728 + ClearPageCompound(page);
  729 + compound_unlock(page);
  730 + spin_unlock_irq(&zone->lru_lock);
  731 +
  732 + for (i = 1; i < HPAGE_PMD_NR; i++) {
  733 + struct page *page_tail = page + i;
  734 + BUG_ON(page_count(page_tail) <= 0);
  735 + /*
  736 + * Tail pages may be freed if there wasn't any mapping
  737 + * like if add_to_swap() is running on a lru page that
  738 + * had its mapping zapped. And freeing these pages
  739 + * requires taking the lru_lock so we do the put_page
  740 + * of the tail pages after the split is complete.
  741 + */
  742 + put_page(page_tail);
  743 + }
  744 +
  745 + /*
  746 + * Only the head page (now become a regular page) is required
  747 + * to be pinned by the caller.
  748 + */
  749 + BUG_ON(page_count(page) <= 0);
  750 +}
  751 +
  752 +static int __split_huge_page_map(struct page *page,
  753 + struct vm_area_struct *vma,
  754 + unsigned long address)
  755 +{
  756 + struct mm_struct *mm = vma->vm_mm;
  757 + pmd_t *pmd, _pmd;
  758 + int ret = 0, i;
  759 + pgtable_t pgtable;
  760 + unsigned long haddr;
  761 +
  762 + spin_lock(&mm->page_table_lock);
  763 + pmd = page_check_address_pmd(page, mm, address,
  764 + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
  765 + if (pmd) {
  766 + pgtable = get_pmd_huge_pte(mm);
  767 + pmd_populate(mm, &_pmd, pgtable);
  768 +
  769 + for (i = 0, haddr = address; i < HPAGE_PMD_NR;
  770 + i++, haddr += PAGE_SIZE) {
  771 + pte_t *pte, entry;
  772 + BUG_ON(PageCompound(page+i));
  773 + entry = mk_pte(page + i, vma->vm_page_prot);
  774 + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
  775 + if (!pmd_write(*pmd))
  776 + entry = pte_wrprotect(entry);
  777 + else
  778 + BUG_ON(page_mapcount(page) != 1);
  779 + if (!pmd_young(*pmd))
  780 + entry = pte_mkold(entry);
  781 + pte = pte_offset_map(&_pmd, haddr);
  782 + BUG_ON(!pte_none(*pte));
  783 + set_pte_at(mm, haddr, pte, entry);
  784 + pte_unmap(pte);
  785 + }
  786 +
  787 + mm->nr_ptes++;
  788 + smp_wmb(); /* make pte visible before pmd */
  789 + /*
  790 + * Up to this point the pmd is present and huge and
  791 + * userland has the whole access to the hugepage
  792 + * during the split (which happens in place). If we
  793 + * overwrite the pmd with the not-huge version
  794 + * pointing to the pte here (which of course we could
  795 + * if all CPUs were bug free), userland could trigger
  796 + * a small page size TLB miss on the small sized TLB
  797 + * while the hugepage TLB entry is still established
  798 + * in the huge TLB. Some CPU doesn't like that. See
  799 + * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
  800 + * Erratum 383 on page 93. Intel should be safe but is
  801 + * also warns that it's only safe if the permission
  802 + * and cache attributes of the two entries loaded in
  803 + * the two TLB is identical (which should be the case
  804 + * here). But it is generally safer to never allow
  805 + * small and huge TLB entries for the same virtual
  806 + * address to be loaded simultaneously. So instead of
  807 + * doing "pmd_populate(); flush_tlb_range();" we first
  808 + * mark the current pmd notpresent (atomically because
  809 + * here the pmd_trans_huge and pmd_trans_splitting
  810 + * must remain set at all times on the pmd until the
  811 + * split is complete for this pmd), then we flush the
  812 + * SMP TLB and finally we write the non-huge version
  813 + * of the pmd entry with pmd_populate.
  814 + */
  815 + set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
  816 + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
  817 + pmd_populate(mm, pmd, pgtable);
  818 + ret = 1;
  819 + }
  820 + spin_unlock(&mm->page_table_lock);
  821 +
  822 + return ret;
  823 +}
  824 +
  825 +/* must be called with anon_vma->root->lock hold */
  826 +static void __split_huge_page(struct page *page,
  827 + struct anon_vma *anon_vma)
  828 +{
  829 + int mapcount, mapcount2;
  830 + struct anon_vma_chain *avc;
  831 +
  832 + BUG_ON(!PageHead(page));
  833 + BUG_ON(PageTail(page));
  834 +
  835 + mapcount = 0;
  836 + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
  837 + struct vm_area_struct *vma = avc->vma;
  838 + unsigned long addr = vma_address(page, vma);
  839 + BUG_ON(is_vma_temporary_stack(vma));
  840 + if (addr == -EFAULT)
  841 + continue;
  842 + mapcount += __split_huge_page_splitting(page, vma, addr);
  843 + }
  844 + BUG_ON(mapcount != page_mapcount(page));
  845 +
  846 + __split_huge_page_refcount(page);
  847 +
  848 + mapcount2 = 0;
  849 + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
  850 + struct vm_area_struct *vma = avc->vma;
  851 + unsigned long addr = vma_address(page, vma);
  852 + BUG_ON(is_vma_temporary_stack(vma));
  853 + if (addr == -EFAULT)
  854 + continue;
  855 + mapcount2 += __split_huge_page_map(page, vma, addr);
  856 + }
  857 + BUG_ON(mapcount != mapcount2);
  858 +}
  859 +
  860 +int split_huge_page(struct page *page)
  861 +{
  862 + struct anon_vma *anon_vma;
  863 + int ret = 1;
  864 +
  865 + BUG_ON(!PageAnon(page));
  866 + anon_vma = page_lock_anon_vma(page);
  867 + if (!anon_vma)
  868 + goto out;
  869 + ret = 0;
  870 + if (!PageCompound(page))
  871 + goto out_unlock;
  872 +
  873 + BUG_ON(!PageSwapBacked(page));
  874 + __split_huge_page(page, anon_vma);
  875 +
  876 + BUG_ON(PageCompound(page));
  877 +out_unlock:
  878 + page_unlock_anon_vma(anon_vma);
  879 +out:
  880 + return ret;
  881 +}
  882 +
  883 +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
  884 +{
  885 + struct page *page;
  886 +
  887 + spin_lock(&mm->page_table_lock);
  888 + if (unlikely(!pmd_trans_huge(*pmd))) {
  889 + spin_unlock(&mm->page_table_lock);
  890 + return;
  891 + }
  892 + page = pmd_page(*pmd);
  893 + VM_BUG_ON(!page_count(page));
  894 + get_page(page);
  895 + spin_unlock(&mm->page_table_lock);
  896 +
  897 + split_huge_page(page);
  898 +
  899 + put_page(page);
  900 + BUG_ON(pmd_trans_huge(*pmd));
  901 +}
... ... @@ -134,6 +134,10 @@
134 134 }
135 135 }
136 136  
  137 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  138 +extern unsigned long vma_address(struct page *page,
  139 + struct vm_area_struct *vma);
  140 +#endif
137 141 #else /* !CONFIG_MMU */
138 142 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
139 143 {
... ... @@ -726,9 +726,9 @@
726 726 return 0;
727 727 }
728 728  
729   -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
730   - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
731   - unsigned long addr, unsigned long end)
  729 +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  730 + pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
  731 + unsigned long addr, unsigned long end)
732 732 {
733 733 pte_t *orig_src_pte, *orig_dst_pte;
734 734 pte_t *src_pte, *dst_pte;
... ... @@ -802,6 +802,16 @@
802 802 src_pmd = pmd_offset(src_pud, addr);
803 803 do {
804 804 next = pmd_addr_end(addr, end);
  805 + if (pmd_trans_huge(*src_pmd)) {
  806 + int err;
  807 + err = copy_huge_pmd(dst_mm, src_mm,
  808 + dst_pmd, src_pmd, addr, vma);
  809 + if (err == -ENOMEM)
  810 + return -ENOMEM;
  811 + if (!err)
  812 + continue;
  813 + /* fall through */
  814 + }
805 815 if (pmd_none_or_clear_bad(src_pmd))
806 816 continue;
807 817 if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
... ... @@ -1004,6 +1014,15 @@
1004 1014 pmd = pmd_offset(pud, addr);
1005 1015 do {
1006 1016 next = pmd_addr_end(addr, end);
  1017 + if (pmd_trans_huge(*pmd)) {
  1018 + if (next-addr != HPAGE_PMD_SIZE)
  1019 + split_huge_page_pmd(vma->vm_mm, pmd);
  1020 + else if (zap_huge_pmd(tlb, vma, pmd)) {
  1021 + (*zap_work)--;
  1022 + continue;
  1023 + }
  1024 + /* fall through */
  1025 + }
1007 1026 if (pmd_none_or_clear_bad(pmd)) {
1008 1027 (*zap_work)--;
1009 1028 continue;
1010 1029  
... ... @@ -1280,11 +1299,27 @@
1280 1299 pmd = pmd_offset(pud, address);
1281 1300 if (pmd_none(*pmd))
1282 1301 goto no_page_table;
1283   - if (pmd_huge(*pmd)) {
  1302 + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
1284 1303 BUG_ON(flags & FOLL_GET);
1285 1304 page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
1286 1305 goto out;
1287 1306 }
  1307 + if (pmd_trans_huge(*pmd)) {
  1308 + spin_lock(&mm->page_table_lock);
  1309 + if (likely(pmd_trans_huge(*pmd))) {
  1310 + if (unlikely(pmd_trans_splitting(*pmd))) {
  1311 + spin_unlock(&mm->page_table_lock);
  1312 + wait_split_huge_page(vma->anon_vma, pmd);
  1313 + } else {
  1314 + page = follow_trans_huge_pmd(mm, address,
  1315 + pmd, flags);
  1316 + spin_unlock(&mm->page_table_lock);
  1317 + goto out;
  1318 + }
  1319 + } else
  1320 + spin_unlock(&mm->page_table_lock);
  1321 + /* fall through */
  1322 + }
1288 1323 if (unlikely(pmd_bad(*pmd)))
1289 1324 goto no_page_table;
1290 1325  
... ... @@ -3179,9 +3214,9 @@
3179 3214 * but allow concurrent faults), and pte mapped but not yet locked.
3180 3215 * We return with mmap_sem still held, but pte unmapped and unlocked.
3181 3216 */
3182   -static inline int handle_pte_fault(struct mm_struct *mm,
3183   - struct vm_area_struct *vma, unsigned long address,
3184   - pte_t *pte, pmd_t *pmd, unsigned int flags)
  3217 +int handle_pte_fault(struct mm_struct *mm,
  3218 + struct vm_area_struct *vma, unsigned long address,
  3219 + pte_t *pte, pmd_t *pmd, unsigned int flags)
3185 3220 {
3186 3221 pte_t entry;
3187 3222 spinlock_t *ptl;
3188 3223  
... ... @@ -3260,9 +3295,40 @@
3260 3295 pmd = pmd_alloc(mm, pud, address);
3261 3296 if (!pmd)
3262 3297 return VM_FAULT_OOM;
3263   - pte = pte_alloc_map(mm, vma, pmd, address);
3264   - if (!pte)
  3298 + if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
  3299 + if (!vma->vm_ops)
  3300 + return do_huge_pmd_anonymous_page(mm, vma, address,
  3301 + pmd, flags);
  3302 + } else {
  3303 + pmd_t orig_pmd = *pmd;
  3304 + barrier();
  3305 + if (pmd_trans_huge(orig_pmd)) {
  3306 + if (flags & FAULT_FLAG_WRITE &&
  3307 + !pmd_write(orig_pmd) &&
  3308 + !pmd_trans_splitting(orig_pmd))
  3309 + return do_huge_pmd_wp_page(mm, vma, address,
  3310 + pmd, orig_pmd);
  3311 + return 0;
  3312 + }
  3313 + }
  3314 +
  3315 + /*
  3316 + * Use __pte_alloc instead of pte_alloc_map, because we can't
  3317 + * run pte_offset_map on the pmd, if an huge pmd could
  3318 + * materialize from under us from a different thread.
  3319 + */
  3320 + if (unlikely(__pte_alloc(mm, vma, pmd, address)))
3265 3321 return VM_FAULT_OOM;
  3322 + /* if an huge pmd materialized from under us just retry later */
  3323 + if (unlikely(pmd_trans_huge(*pmd)))
  3324 + return 0;
  3325 + /*
  3326 + * A regular pmd is established and it can't morph into a huge pmd
  3327 + * from under us anymore at this point because we hold the mmap_sem
  3328 + * read mode and khugepaged takes it in write mode. So now it's
  3329 + * safe to run pte_offset_map().
  3330 + */
  3331 + pte = pte_offset_map(pmd, address);
3266 3332  
3267 3333 return handle_pte_fault(mm, vma, address, pte, pmd, flags);
3268 3334 }
... ... @@ -360,7 +360,7 @@
360 360 * Returns virtual address or -EFAULT if page's index/offset is not
361 361 * within the range mapped the @vma.
362 362 */
363   -static inline unsigned long
  363 +inline unsigned long
364 364 vma_address(struct page *page, struct vm_area_struct *vma)
365 365 {
366 366 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
... ... @@ -435,6 +435,8 @@
435 435 pmd = pmd_offset(pud, address);
436 436 if (!pmd_present(*pmd))
437 437 return NULL;
  438 + if (pmd_trans_huge(*pmd))
  439 + return NULL;
438 440  
439 441 pte = pte_offset_map(pmd, address);
440 442 /* Make a quick check before getting the lock */
441 443  
442 444  
443 445  
444 446  
445 447  
446 448  
... ... @@ -489,46 +491,58 @@
489 491 unsigned long *vm_flags)
490 492 {
491 493 struct mm_struct *mm = vma->vm_mm;
492   - pte_t *pte;
493   - spinlock_t *ptl;
494 494 int referenced = 0;
495 495  
496   - pte = page_check_address(page, mm, address, &ptl, 0);
497   - if (!pte)
498   - goto out;
499   -
500 496 /*
501 497 * Don't want to elevate referenced for mlocked page that gets this far,
502 498 * in order that it progresses to try_to_unmap and is moved to the
503 499 * unevictable list.
504 500 */
505 501 if (vma->vm_flags & VM_LOCKED) {
506   - *mapcount = 1; /* break early from loop */
  502 + *mapcount = 0; /* break early from loop */
507 503 *vm_flags |= VM_LOCKED;
508   - goto out_unmap;
  504 + goto out;
509 505 }
510 506  
511   - if (ptep_clear_flush_young_notify(vma, address, pte)) {
512   - /*
513   - * Don't treat a reference through a sequentially read
514   - * mapping as such. If the page has been used in
515   - * another mapping, we will catch it; if this other
516   - * mapping is already gone, the unmap path will have
517   - * set PG_referenced or activated the page.
518   - */
519   - if (likely(!VM_SequentialReadHint(vma)))
520   - referenced++;
521   - }
522   -
523 507 /* Pretend the page is referenced if the task has the
524 508 swap token and is in the middle of a page fault. */
525 509 if (mm != current->mm && has_swap_token(mm) &&
526 510 rwsem_is_locked(&mm->mmap_sem))
527 511 referenced++;
528 512  
529   -out_unmap:
  513 + if (unlikely(PageTransHuge(page))) {
  514 + pmd_t *pmd;
  515 +
  516 + spin_lock(&mm->page_table_lock);
  517 + pmd = page_check_address_pmd(page, mm, address,
  518 + PAGE_CHECK_ADDRESS_PMD_FLAG);
  519 + if (pmd && !pmd_trans_splitting(*pmd) &&
  520 + pmdp_clear_flush_young_notify(vma, address, pmd))
  521 + referenced++;
  522 + spin_unlock(&mm->page_table_lock);
  523 + } else {
  524 + pte_t *pte;
  525 + spinlock_t *ptl;
  526 +
  527 + pte = page_check_address(page, mm, address, &ptl, 0);
  528 + if (!pte)
  529 + goto out;
  530 +
  531 + if (ptep_clear_flush_young_notify(vma, address, pte)) {
  532 + /*
  533 + * Don't treat a reference through a sequentially read
  534 + * mapping as such. If the page has been used in
  535 + * another mapping, we will catch it; if this other
  536 + * mapping is already gone, the unmap path will have
  537 + * set PG_referenced or activated the page.
  538 + */
  539 + if (likely(!VM_SequentialReadHint(vma)))
  540 + referenced++;
  541 + }
  542 + pte_unmap_unlock(pte, ptl);
  543 + }
  544 +
530 545 (*mapcount)--;
531   - pte_unmap_unlock(pte, ptl);
532 546  
533 547 if (referenced)
534 548 *vm_flags |= vma->vm_flags;
... ... @@ -1202,7 +1216,7 @@
1202 1216 return ret;
1203 1217 }
1204 1218  
1205   -static bool is_vma_temporary_stack(struct vm_area_struct *vma)
  1219 +bool is_vma_temporary_stack(struct vm_area_struct *vma)
1206 1220 {
1207 1221 int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
1208 1222  
... ... @@ -479,6 +479,43 @@
479 479  
480 480 EXPORT_SYMBOL(__pagevec_release);
481 481  
  482 +/* used by __split_huge_page_refcount() */
  483 +void lru_add_page_tail(struct zone* zone,
  484 + struct page *page, struct page *page_tail)
  485 +{
  486 + int active;
  487 + enum lru_list lru;
  488 + const int file = 0;
  489 + struct list_head *head;
  490 +
  491 + VM_BUG_ON(!PageHead(page));
  492 + VM_BUG_ON(PageCompound(page_tail));
  493 + VM_BUG_ON(PageLRU(page_tail));
  494 + VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
  495 +
  496 + SetPageLRU(page_tail);
  497 +
  498 + if (page_evictable(page_tail, NULL)) {
  499 + if (PageActive(page)) {
  500 + SetPageActive(page_tail);
  501 + active = 1;
  502 + lru = LRU_ACTIVE_ANON;
  503 + } else {
  504 + active = 0;
  505 + lru = LRU_INACTIVE_ANON;
  506 + }
  507 + update_page_reclaim_stat(zone, page_tail, file, active);
  508 + if (likely(PageLRU(page)))
  509 + head = page->lru.prev;
  510 + else
  511 + head = &zone->lru[lru].list;
  512 + __add_page_to_lru_list(zone, page_tail, lru, head);
  513 + } else {
  514 + SetPageUnevictable(page_tail);
  515 + add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
  516 + }
  517 +}
  518 +
482 519 /*
483 520 * Add the passed pages to the LRU, then drop the caller's refcount
484 521 * on them. Reinitialises the caller's pagevec.