13 Jan, 2012

6 commits

  • This patch started off as a cleanup: __split_huge_page_refcounts() has to
    cope with two scenarios, when the hugepage being split is already on LRU,
    and when it is not; but why does it have to split that accounting across
    three different sites? Consolidate it in lru_add_page_tail(), handling
    evictable and unevictable alike, and use standard add_page_to_lru_list()
    when accounting is needed (when the head is not yet on LRU).

    But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
    test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
    under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
    messing up reclaim calculations and causing a freeze at rmdir of cgroup.

    Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
    count - this has not been the only such incident. Document that
    lru_add_page_tail() is for Transparent HugePages by #ifdef around it.

    Signed-off-by: Hugh Dickins
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Put the tail subpages of an isolated hugepage under splitting in the lru
    reclaim head as they supposedly should be isolated too next.

    Queues the subpages in physical order in the lru for non isolated
    hugepages under splitting. That might provide some theoretical cache
    benefit to the buddy allocator later.

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • change_protection() will do TLB flush later, don't need duplicate tlb
    flush.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Improve the error code path. Delete unnecessary sysfs file for example.
    Also remove the #ifdef xxx to make code better.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
    page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
    page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

    But thinking again,
    - compound_lock() is held at move_accout...then, it's not necessary
    to take move_lock_page_cgroup().
    - LRU is locked and all tail pages will go into the same LRU as
    head is now on.
    - page_cgroup is contiguous in huge page range.

    This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
    hugepage and reduce costs for spliting.

    [akpm@linux-foundation.org: fix typo, per Michal]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Dec, 2011

1 commit

  • khugepaged can sometimes cause suspend to fail, requiring that the user
    retry the suspend operation.

    Use wait_event_freezable_timeout() instead of
    schedule_timeout_interruptible() to avoid missing freezer wakeups. A
    try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
    tight loop too in case of the allocation failing repeatedly, and
    wait_event_freezable_timeout will provide it too.

    khugepaged would still freeze just fine by trying again the next minute
    but it's better if it freezes immediately.

    Reported-by: Jiri Slaby
    Signed-off-by: Andrea Arcangeli
    Tested-by: Jiri Slaby
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Srivatsa S. Bhat"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

01 Nov, 2011

4 commits

  • There are three cases of update_mmu_cache() in the file, and the case in
    function collapse_huge_page() has a typo, namely the last parameter used,
    which is corrected based on the other two cases.

    Due to the define of update_mmu_cache by X86, the only arch that
    implements THP currently, the change here has no really crystal point, but
    one or two minutes of efforts could be saved for those archs that are
    likely to support THP in future.

    Signed-off-by: Hillf Danton
    Cc: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • The THP copy-on-write handler falls back to regular-sized pages for a huge
    page replacement upon allocation failure or if THP has been individually
    disabled in the target VMA. The loop responsible for copying page-sized
    chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
    the virtual address arg for copy_user_highpage().

    Signed-off-by: Hillf Danton
    Acked-by: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Quiet the sparse noise:

    warning: symbol 'khugepaged_scan' was not declared. Should it be static?
    warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

    Signed-off-by: H Hartley Sweeten
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • This adds THP support to mremap (decreases the number of split_huge_page()
    calls).

    Here are also some benchmarks with a proggy like this:

    ===
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define SIZE (5UL*1024*1024*1024)

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    gettimeofday(&oldstamp, NULL);
    p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    gettimeofday(&newstamp, NULL);
    diffsec = newstamp.tv_sec - oldstamp.tv_sec;
    diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
    printf("usec %ld\n", diffsec);
    if (p == MAP_FAILED || p4 != p3)
    //if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    if (memcmp(p4, p2, SIZE))
    printf("mremap bug\n"), exit(1);
    printf("ok\n");

    return 0;
    }
    ===

    THP on

    Performance counter stats for './largepage13' (3 runs):

    69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
    60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
    676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
    29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
    1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
    8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

    7.314454164 seconds time elapsed ( +- 0.023% )

    THP off

    Performance counter stats for './largepage13' (3 runs):

    1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
    9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
    2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
    3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
    6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
    8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

    9.693655243 seconds time elapsed ( +- 0.069% )

    grep thp /proc/vmstat
    thp_fault_alloc 35849
    thp_fault_fallback 0
    thp_collapse_alloc 3
    thp_collapse_alloc_failed 0
    thp_split 0

    thp_split 0 confirms no thp split despite plenty of hugepages allocated.

    The measurement of only the mremap time (so excluding the 3 long
    memset and final long 10GB memory accessing memcmp):

    THP on

    usec 14824
    usec 14862
    usec 14859

    THP off

    usec 256416
    usec 255981
    usec 255847

    With an older kernel without the mremap optimizations (the below patch
    optimizes the non THP version too).

    THP on

    usec 392107
    usec 390237
    usec 404124

    THP off

    usec 444294
    usec 445237
    usec 445820

    I guess with a threaded program that sends more IPI on large SMP it'd
    create an even larger difference.

    All debug options are off except DEBUG_VM to avoid skewing the
    results.

    The only problem for native 2M mremap like it happens above both the
    source and destination address must be 2M aligned or the hugepmd can't be
    moved without a split but that is an hardware limitation.

    [akpm@linux-foundation.org: coding-style nitpicking]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jul, 2011

1 commit


16 Jun, 2011

1 commit

  • Johannes noticed the vmstat update is already taken care of by
    khugepaged_alloc_hugepage() internally. The only places that are required
    to update the vmstat are the callers of alloc_hugepage (callers of
    khugepaged_alloc_hugepage aren't).

    Signed-off-by: Andrea Arcangeli
    Reported-by: Johannes Weiner
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

2 commits

  • We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(),
    the mmap_sem is only hold for keeping the vma stable and we don't need the
    vma stable anymore after we return from alloc_hugepage_vma().

    Signed-off-by: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Straightforward conversion of anon_vma->lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Apr, 2011

1 commit

  • The huge_memory.c THP page fault was allowed to run if vm_ops was null
    (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
    setup a special vma->vm_ops and it would fallback to regular anonymous
    memory) but other THP logics weren't fully activated for vmas with vm_file
    not NULL (/dev/zero has a not NULL vma->vm_file).

    So this removes the vm_file checks so that /dev/zero also can safely use
    THP (the other albeit safer approach to fix this bug would have been to
    prevent the THP initial page fault to run if vm_file was set).

    After removing the vm_file checks, this also makes huge_memory.c stricter
    in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
    check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
    under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
    only be allowed to exist before the first page fault, and in turn when
    vma->anon_vma is null (so preventing khugepaged registration). So I tend
    to think the previous comment saying if vm_file was set, VM_PFNMAP might
    have been set and we could still be registered in khugepaged (despite
    anon_vma was not NULL to be registered in khugepaged) was too paranoid.
    The is_linear_pfn_mapping check is also I think superfluous (as described
    by comment) but under DEBUG_VM it is safe to stay.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

    Signed-off-by: Andrea Arcangeli
    Reported-by: Caspar Zhang
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Apr, 2011

2 commits

  • The conventional format for boolean attributes in sysfs is numeric ("0" or
    "1" followed by new-line). Any boolean attribute can then be read and
    written using a generic function. Using the strings "yes [no]", "[yes]
    no" (read), "yes" and "no" (write) will frustrate this.

    [akpm@linux-foundation.org: use kstrtoul()]
    [akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
    Signed-off-by: Ben Hutchings
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Tested-by: David Rientjes
    Cc: NeilBrown
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     
  • I found it difficult to make sense of transparent huge pages without
    having any counters for its actions. Add some counters to vmstat for
    allocation of transparent hugepages and fallback to smaller pages.

    Optional patch, but useful for development and understanding the system.

    Contains improvements from Andrea Arcangeli and Johannes Weiner

    [akpm@linux-foundation.org: coding-style fixes]
    [hannes@cmpxchg.org: fix vmstat_text[] entries]
    Signed-off-by: Andi Kleen
    Acked-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

23 Mar, 2011

1 commit

  • Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
    hugepages daemon. This way the low level accounting for local versus
    remote pages works correctly.

    Contains improvements from Andrea Arcangeli

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Mar, 2011

1 commit

  • THP's collapse_huge_page() has an understandable but ugly difference
    in when its huge page is allocated: inside if NUMA but outside if not.
    It's hardly surprising that the memcg failure path forgot that, freeing
    the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
    (or even worse, using the freed page).

    Signed-off-by: Hugh Dickins
    Reviewed-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Mar, 2011

3 commits

  • Pass down the correct node for a transparent hugepage allocation. Most
    callers continue to use the current node, however the hugepaged daemon
    now uses the previous node of the first to be collapsed page instead.
    This ensures that khugepaged does not mess up local memory for an
    existing process which uses local policy.

    The choice of node is somewhat primitive currently: it just uses the
    node of the first page in the pmd range. An alternative would be to
    look at multiple pages and use the most popular node. I used the
    simplest variant for now which should work well enough for the case of
    all pages being on the same node.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This makes a difference for LOCAL policy, where the node cannot be
    determined from the policy itself, but has to be gotten from the original
    page.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Currently alloc_pages_vma() always uses the local node as policy node for
    the LOCAL policy. Pass this node down as an argument instead.

    No behaviour change from this patch, but will be needed for followons.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

16 Feb, 2011

1 commit

  • Transparent hugepages can only be created if rmap is fully
    functional. So we must prevent hugepages to be created while
    is_vma_temporary_stack() is true.

    This also optmizes away some harmless but unnecessary setting of
    khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Feb, 2011

1 commit

  • mem_cgroup_uncharge_page() should be called in all failure cases after
    mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

    [ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
    [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
    [ 4209.078674] page flags: 0x40000000004000(head)
    [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
    [ 4209.082177] (/A)
    [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
    [ 4209.083412] Call Trace:
    [ 4209.083678] [] ? bad_page+0xe4/0x140
    [ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
    [ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
    [ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
    [ 4209.086110] [] ? free_compound_page+0x1b/0x20
    [ 4209.086699] [] ? __put_compound_page+0x1c/0x30
    [ 4209.087333] [] ? put_compound_page+0x4d/0x200
    [ 4209.087935] [] ? put_page+0x45/0x50
    [ 4209.097361] [] ? khugepaged+0x9e9/0x1430
    [ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
    [ 4209.099121] [] ? khugepaged+0x0/0x1430
    [ 4209.099780] [] ? kthread+0x96/0xa0
    [ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
    [ 4209.101214] [] ? kthread+0x0/0xa0
    [ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

03 Feb, 2011

1 commit

  • When the tail page of THP is poisoned, the head page will be poisoned too.
    And the wrong address, address of head page, will be sent with sigbus
    always.

    So when the poisoned page is used by Guest OS which is running on KVM,
    after the address changing(hva->gpa) by qemu, the unexpected process on
    Guest OS will be killed by sigbus.

    What we expected is that the process using the poisoned tail page could be
    killed on Guest OS, but not that the process using the healthy head page
    is killed.

    Since it is not good to poison the healthy page, avoid poisoning other
    than the page which is really poisoned.
    (While we poison all pages in a huge page in case of hugetlb,
    we can do this for THP thanks to split_huge_page().)

    Here we fix two parts:
    1. Isolate the poisoned page only to make sure
    the reported address is the address of poisoned page.
    2. make the poisoned page work as the poisoned regular page.

    [akpm@linux-foundation.org: fix spello in comment]
    Signed-off-by: Jin Dongming
    Reviewed-by: Hidetoshi Seto
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jin Dongming
     

21 Jan, 2011

2 commits

  • Now, under THP:

    at charge:
    - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
    ....set to 512 pages.
    at uncharge
    - PageCgroupUsed bit is unset on the head page.

    So, some pages will remain with "Used" bit.

    This patch fixes that Used bit is set only to the head page.
    Used bits for tail pages will be set at splitting if necessary.

    This patch adds this lock order:
    compound_lock() -> page_cgroup_move_lock().

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Two users reported THP-related crashes on 32-bit x86 machines. Their oops
    reports indicated an invalid pte, and subsequent code inspection showed
    that the highpte is actually used after unmap.

    The fix is to unmap the pte only after all operations against it are
    finished.

    Signed-off-by: Johannes Weiner
    Reported-by: Ilya Dryomov
    Reported-by: werner
    Cc: Andrea Arcangeli
    Tested-by: Ilya Dryomov
    Tested-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

11 commits

  • MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
    mmap and before touching the memory. While this is enough for most
    usages, it's little effort to make madvise more dynamic at runtime on an
    existing mapping by making khugepaged aware about madvise.

    MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
    fault (that may not ever happen if all pages are already mapped and the
    "enabled" knob was set to madvise during the initial page faults).

    MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
    collapsing pages where not needed.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andrea Arcangeli
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
    statistics, so the Active(anon) and Inactive(anon) statistics in
    /proc/meminfo are correct.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • On small systems, the extra memory used by the anti-fragmentation memory
    reserve and simply because huge pages are smaller than large pages can
    easily outweigh the benefits of less TLB misses.

    A less obvious concern is if run on a NUMA machine with asymmetric node
    sizes and one of them is very small. The reserve could make the node
    unusable.

    In case of the crashdump kernel, OOMs have been observed due to the
    anti-fragmentation memory reserve taking up a large fraction of the
    crashdump image.

    This patch disables transparent hugepages on systems with less than 1GB of
    RAM, but the hugepage subsystem is fully initialized so administrators can
    enable THP through /sys if desired.

    Signed-off-by: Rik van Riel
    Acked-by: Avi Kiviti
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • It's unclear why schedule friendly kernel threads can't be taken away by
    the CPU through the scheduler itself. It's safer to stop them as they can
    trigger memory allocation, if kswapd also freezes itself to avoid
    generating I/O they have too.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • For GRU and EPT, we need gup-fast to set referenced bit too (this is why
    it's correct to return 0 when shadow_access_mask is zero, it requires
    gup-fast to set the referenced bit). qemu-kvm access already sets the
    young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
    paging EPT minor fault we relay on gup-fast to signal the page is in
    use...

    We also need to check the young bits on the secondary pagetables for NPT
    and not nested shadow mmu as the data may never get accessed again by the
    primary pte.

    Without this closer accuracy, we'd have to remove the heuristic that
    avoids collapsing hugepages in hugepage virtual regions that have not even
    a single subpage in use.

    ->test_young is full backwards compatible with GRU and other usages that
    don't have young bits in pagetables set by the hardware and that should
    nuke the secondary mmu mappings when ->clear_flush_young runs just like
    EPT does.

    Removing the heuristic that checks the young bit in
    khugepaged/collapse_huge_page completely isn't so bad either probably but
    I thought it was worth it and this makes it reliable.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Archs implementing Transparent Hugepage Support must implement a function
    called has_transparent_hugepage to be sure the virtual or physical CPU
    supports Transparent Hugepages.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • An huge pmd can only be mapped if the corresponding 2M virtual range is
    fully contained in the vma. At times the VM calls split_vma twice, if the
    first split_vma succeeds and the second fail, the first split_vma remains
    in effect and it's not rolled back. For split_vma or vma_adjust to fail
    an allocation failure is needed so it's a very unlikely event (the out of
    memory killer would normally fire before any allocation failure is visible
    to kernel and userland and if an out of memory condition happens it's
    unlikely to happen exactly here). Nevertheless it's safer to ensure that
    no huge pmd can be left around if the vma is adjusted in a way that can't
    fit hugepages anymore at the new vm_start/vm_end address.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Allow to choose between the always|madvise default for page faults and
    khugepaged at config time. madvise guarantees zero risk of higher memory
    footprint for applications (applications using madvise(MADV_HUGEPAGE)
    won't risk to use any more memory by backing their virtual regions with
    hugepages).

    Initially set the default to N and don't depend on EMBEDDED.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This tries to be more friendly to filesystem in userland, with userland
    backends that allocate memory in the I/O paths and that could deadlock if
    khugepaged holds the mmap_sem write mode of the userland backend while
    allocating memory. Memory allocation may wait for writeback I/O
    completion from the daemon that may be blocked in the mmap_sem read mode
    if a page fault happens and the daemon wasn't using mlock for the memory
    required for the I/O submission and completion.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
    introducing alloc_pages_vma. khugepaged needs special handling as the
    allocation has to happen inside collapse_huge_page where the vma is known
    and an error has to be returned to the outer loop to sleep
    alloc_sleep_millisecs in case of failure. But it retains the more
    efficient logic of handling allocation failures in khugepaged in case of
    CONFIG_NUMA=n.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli