14 Jan, 2011

6 commits

  • Replace usage of the mem_cgroup_update_file_mapped() memcg
    statistic update routine with two new routines:
    * mem_cgroup_inc_page_stat()
    * mem_cgroup_dec_page_stat()

    As before, only the file_mapped statistic is managed. However, these more
    general interfaces allow for new statistics to be more easily added. New
    statistics are added with memcg dirty page accounting.

    Signed-off-by: Greg Thelen
    Signed-off-by: Andrea Righi
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
    pages and that broke THP as split_huge_page was then called on hugetlbfs
    pages too.

    compound_head/order was also run unsafe on THP pages that can be splitted
    at any time.

    All compound_head() invocations in memory-failure.c that are run on pages
    that aren't pinned and that can be freed and reused from under us (while
    compound_head is running) are buggy because compound_head can return a
    dangling pointer, but I'm not fixing this as this is a generic
    memory-failure bug not specific to THP but it applies to hugetlbfs too, so
    I can fix it later after THP is merged upstream.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add hugepage stat information to /proc/vmstat and /proc/meminfo.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This documents how split_huge_page is safe vs new vma inserctions into the
    anon_vma that may have already released the anon_vma->lock but not
    established pmds yet when split_huge_page starts.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Paging logic that splits the page before it is unmapped and added to swap
    to ensure backwards compatibility with the legacy swap code. Eventually
    swap should natively pageout the hugepages to increase performance and
    decrease seeking and fragmentation of swap space. swapoff can just skip
    over huge pmd as they cannot be part of swap yet. In add_to_swap be
    careful to split the page only if we got a valid swap entry so we don't
    split hugepages with a full swap.

    In theory we could split pages before isolating them during the lru scan,
    but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
    or PG_lock taken, so split_huge_page has to run either with mmap_sem
    read/write mode or PG_lock taken. Calling it from isolate_lru_page would
    make locking more complicated, in addition to that split_huge_page would
    deadlock if called by __isolate_lru_page because it has to take the lru
    lock to add the tail pages.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Dec, 2010

1 commit


27 Oct, 2010

4 commits

  • Make anon_vma_chain_free() static. It is called only in rmap.c and the
    corresponding alloc function is already static.

    Signed-off-by: Namhyung Kim
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • The page_check_address() conditionally grabs *@ptlp in case of returning
    non-NULL. Rename and wrap it using __cond_lock() removes following
    warnings from sparse:

    mm/rmap.c:472:9: warning: context imbalance in 'page_mapped_in_vma' - unexpected unlock
    mm/rmap.c:524:9: warning: context imbalance in 'page_referenced_one' - unexpected unlock
    mm/rmap.c:706:9: warning: context imbalance in 'page_mkclean_one' - unexpected unlock
    mm/rmap.c:1066:9: warning: context imbalance in 'try_to_unmap_one' - unexpected unlock

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • The page_lock_anon_vma() conditionally grabs RCU and anon_vma lock but
    page_unlock_anon_vma() releases them unconditionally. This leads sparse
    to complain about context imbalance. Annotate them.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (22 commits)
    Add _addr_lsb field to ia64 siginfo
    Fix migration.c compilation on s390
    HWPOISON: Remove retry loop for try_to_unmap
    HWPOISON: Turn addr_valid from bitfield into char
    HWPOISON: Disable DEBUG by default
    HWPOISON: Convert pr_debugs to pr_info
    HWPOISON: Improve comments in memory-failure.c
    x86: HWPOISON: Report correct address granuality for huge hwpoison faults
    Encode huge page size for VM_FAULT_HWPOISON errors
    Fix build error with !CONFIG_MIGRATION
    hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning
    Clean up __page_set_anon_rmap
    HWPOISON, hugetlb: fix unpoison for hugepage
    HWPOISON, hugetlb: soft offlining for hugepage
    HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
    hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
    HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
    hugetlb: hugepage migration core
    hugetlb: redefine hugepage copy functions
    hugetlb: add allocate function for hugepage migration
    ...

    Linus Torvalds
     

25 Oct, 2010

1 commit


08 Oct, 2010

1 commit

  • Linus asked for a cleanup of __page_set_anon_rmap to make
    it look more like the cleaner huge pages version.

    Factor out the duplicated PageAnon check into a single check
    at the beginning of the function.

    Remove obsolete comments and rewrite them into standard English.

    No functional changes.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

05 Oct, 2010

1 commit

  • 2.6.36-rc1 commit 21d0d443cdc1658a8c1484fdcece4803f0f96d0e "rmap:
    resurrect page_address_in_vma anon_vma check" was right to resurrect
    that check; but now that it's comparing anon_vma->roots instead of
    just anon_vmas, there's a danger of oopsing on a NULL anon_vma.

    In most cases no NULL anon_vma ever gets here; but it turns out that
    occasionally KSM, when enabled on a forked or forking process, will
    itself call page_address_in_vma() on a "half-KSM" page left over from
    an earlier failed attempt to merge - whose page_anon_vma() is NULL.

    It's my bug that those should be getting here at all: I thought they
    were already dealt with, this oops proves me wrong, I'll fix it in
    the next release - such pages are effectively pinned until their
    process exits, since rmap cannot find their ptes (though swapoff can).

    For now just work around it by making page_address_in_vma() safe (and
    add a comment on why that check is wanted anyway). A similar check
    in __page_check_anon_rmap() is safe because do_page_add_anon_rmap()
    already excluded KSM pages.

    Signed-off-by: Hugh Dickins
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Sep, 2010

2 commits


29 Aug, 2010

1 commit

  • After several hours, kbuild tests hang with anon_vma_prepare() spinning on
    a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
    (which makes this very much more likely, but it could happen without).

    The ever-subtle page_lock_anon_vma() now needs a further twist: since
    anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
    of a reused anon_vma structure at any moment, page_lock_anon_vma()
    needs to check page_mapped() again before succeeding, otherwise
    page_unlock_anon_vma() might address a different root->lock.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Aug, 2010

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     

11 Aug, 2010

2 commits

  • CONFIG_HUGETLBFS controls hugetlbfs interface code.
    OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code.
    So we should use CONFIG_HUGETLB_PAGE here.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

8 commits

  • On swapin it is fairly common for a page to be owned exclusively by one
    process. In that case we want to add the page to the anon_vma of that
    process's VMA, instead of to the root anon_vma.

    This will reduce the amount of rmap searching that the swapout code needs
    to do.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Verify the refcounting doesn't go wrong, and resurrect the check in
    __page_check_anon_rmap as in old anon-vma code.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With root anon-vma it's trivial to keep doing the usual check as in
    old-anon-vma code.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Always use anon_vma->root pointer instead of anon_vma_chain.prev.

    Also optimize the map-paths, if a mapping is already established no need
    to overwrite it with root anon-vma list, we can keep the more finegrined
    anon-vma and skip the overwrite: see the PageAnon check in !exclusive
    case. This is also the optimization that hidden the ksm bug as this tends
    to make ksm_might_need_to_copy skip the copy, but only the proper fix to
    ksm_might_need_to_copy guarantees not triggering the ksm bug unless ksm is
    in use. this is an optimization only...

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    [kamezawa.hiroyu@jp.fujitsu.com: fix false positive BUG_ON in __page_set_anon_rmap]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Make sure to always add new VMAs at the end of the list. This is
    important so rmap_walk does not miss a VMA that was created during the
    rmap_walk.

    The old code got this right most of the time due to luck, but was buggy
    when anon_vma_prepare reused a mergeable anon_vma.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • KSM reference counts can cause an anon_vma to exist after the processe it
    belongs to have already exited. Because the anon_vma lock now lives in
    the root anon_vma, we need to ensure that the root anon_vma stays around
    until after all the "child" anon_vmas have been freed.

    The obvious way to do this is to have a "child" anon_vma take a reference
    to the root in anon_vma_fork. When the anon_vma is freed at munmap or
    process exit, we drop the refcount in anon_vma_unlink and possibly free
    the root anon_vma.

    The KSM anon_vma reference count function also needs to be modified to
    deal with the possibility of freeing 2 levels of anon_vma. The easiest
    way to do this is to break out the KSM magic and make it generic.

    When compiling without CONFIG_KSM, this code is compiled out.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Tested-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Track the root (oldest) anon_vma in each anon_vma tree. Because we only
    take the lock on the root anon_vma, we cannot use the lock on higher-up
    anon_vmas to lock anything. This makes it impossible to do an indirect
    lookup of the root anon_vma, since the data structures could go away from
    under us.

    However, a direct pointer is safe because the root anon_vma is always the
    last one that gets freed on munmap or exit, by virtue of the same_vma list
    order and unlink_anon_vmas walking the list forward.

    [akpm@linux-foundation.org: fix typo]
    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2010

3 commits

  • …tion by not migrating temporary stacks

    Page migration requires rmap to be able to find all ptes mapping a page
    at all times, otherwise the migration entry can be instantiated, but it
    is possible to leave one behind if the second rmap_walk fails to find
    the page. If this page is later faulted, migration_entry_to_page() will
    call BUG because the page is locked indicating the page was migrated by
    the migration PTE not cleaned up. For example

    kernel BUG at include/linux/swapops.h:105!
    invalid opcode: 0000 [#1] PREEMPT SMP
    ...
    Call Trace:
    [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
    [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
    [<ffffffff813099b5>] page_fault+0x25/0x30
    [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
    [<ffffffff8111329b>] search_binary_handler+0x173/0x313
    [<ffffffff81114896>] do_execve+0x219/0x30a
    [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
    [<ffffffff8100320a>] stub_execve+0x6a/0xc0
    RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

    There is a race between shift_arg_pages and migration that triggers this
    bug. A temporary stack is setup during exec and later moved. If
    migration moves a page in the temporary stack and the VMA is then removed
    before migration completes, the migration PTE may not be found leading to
    a BUG when the stack is faulted.

    This patch causes pages within the temporary stack during exec to be
    skipped by migration. It does this by marking the VMA covering the
    temporary stack with an otherwise impossible combination of VMA flags.
    These flags are cleared when the temporary stack is moved to its final
    location.

    [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks. The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory. For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim). Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.

    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner. The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list. The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list. As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area. A compaction
    run completes within a zone when the two scanners meet.

    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis. This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.

    It also does not try relocate virtually contiguous pages to be physically
    contiguous. However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.

    Memory compaction can be triggered in one of three ways. It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory. It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted. When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim. Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.

    The series is in 14 patches. The first three are not "core" to the series
    but are important pre-requisites.

    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    patch, it's possible to use anon_vma after free if the caller is
    not holding a VMA or mmap_sem for the pages in question. While
    there should be no existing user that causes this problem,
    it's a requirement for memory compaction to be stable. The patch
    is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    guarantees about the anon_vma existing. There is a window between
    when a page was isolated and migration started during which anon_vma
    could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    a measure of external fragmentation that takes the size of the
    allocation request into account. It can also be calculated from
    userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    allocation request fails. It determines if an allocation failure
    would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    after compaction.

    Testing of compaction was in three stages. For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out. min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations. It was tested on X86,
    X86-64 and PPC64.

    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.

    1. Machine freshly booted and configured for hugepage usage with
    a) hugeadm --create-global-mounts
    b) hugeadm --pool-pages-max DEFAULT:8G
    c) hugeadm --set-recommended-min_free_kbytes
    d) hugeadm --set-recommended-shmmax

    The min_free_kbytes here is important. Anti-fragmentation works best
    when pageblocks don't mix. hugeadm knows how to calculate a value that
    will significantly reduce the worst of external-fragmentation-related
    events as reported by the mm_page_alloc_extfrag tracepoint.

    2. Load up memory
    a) Start updatedb
    b) Create in parallel a X files of pagesize*128 in size. Wait
    until files are created. By parallel, I mean that 4096 instances
    of dd were launched, one after the other using &. The crude
    objective being to mix filesystem metadata allocations with
    the buffer cache.
    c) Delete every second file so that pageblocks are likely to
    have holes
    d) kill updatedb if it's still running

    At this point, the system is quiet, memory is full but it's full with
    clean filesystem metadata and clean buffer cache that is unmapped.
    This is readily migrated or discarded so you'd expect lumpy reclaim
    to have no significant advantage over compaction but this is at
    the POC stage.

    3. In increments, attempt to allocate 5% of memory as hugepages.
    Measure how long it took, how successful it was, how many
    direct reclaims took place and how how many compactions. Note
    the compaction figures might not fully add up as compactions
    can take place for orders other than the hugepage size

    X86 vanilla compaction
    Final page count 913 916 (attempted 1002)
    pages reclaimed 68296 9791

    X86-64 vanilla compaction
    Final page count: 901 902 (attempted 1002)
    Total pages reclaimed: 112599 53234

    PPC64 vanilla compaction
    Final page count: 93 94 (attempted 110)
    Total pages reclaimed: 103216 61838

    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either. What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.

    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench. None showed anything too remarkable.

    The last test was a high-order allocation stress test. Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations. During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.

    vanilla compaction
    Percentage of request allocated X86 98 99
    Percentage of request allocated X86-64 95 98
    Percentage of request allocated PPC64 55 70

    This patch:

    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.

    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated. This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 May, 2010

1 commit

  • Currently page_address_in_vma() compares vma->anon_vma and
    page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have
    multiple anon_vmas with anon_vma_chain, so current check does not work.
    (For anonymous page shared by multiple processes, some verified (page,vma)
    pairs return -EFAULT wrongly.)

    We can go to checking all anon_vmas in the "same_vma" chain, but it needs
    to meet lock requirement. Instead, we can remove anon_vma check safely
    because page_address_in_vma() assumes that page and vma are already
    checked to belong to the identical process.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Rik van Riel
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

25 Apr, 2010

1 commit

  • If find_mergeable_anon_vma() succeeds but another thread installs
    ->anon_vma before we take ptl, then allocated == NULL but avc should be
    freed. Change the code to check avc != NULL to detect this case.

    Also, a couple of whitespace changes to make the critical section more
    visible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Pete Zaitcev
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 Apr, 2010

1 commit

  • The recent anon_vma fixes cause many anonymous pages to end up
    in the parent process anon_vma, even when the page is exclusively
    owned by the current process.

    Adding exclusively owned anonymous pages to the top anon_vma
    reduces rmap scanning overhead, especially in workloads with
    forking servers.

    This patch adds a parameter to __page_set_anon_rmap that can
    be used to indicate whether or not the added page is exclusively
    owned by the current process.

    Pages added through page_add_new_anon_rmap are exclusively
    owned by the current process, and can be added to the top
    anon_vma.

    Pages added through page_add_anon_rmap can be either shared
    or exclusively owned, so we do the conservative thing and
    add it to the oldest anon_vma.

    A next step would be to add the exclusive parameter to
    page_add_anon_rmap, to be used from functions where we do
    know for sure whether a page is exclusively owned.

    Signed-off-by: Rik van Riel
    Reviewed-by: Johannes Weiner
    Lightly-tested-by: Borislav Petkov
    Reviewed-by: Minchan Kim
    [ Edited to look nicer - Linus ]
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

13 Apr, 2010

2 commits

  • Otherwise we might be mapping in a page in a new mapping, but that page
    (through the swapcache) would later be mapped into an old mapping too.
    The page->mapping must be the case that works for everybody, not just
    the mapping that happened to page it in first.

    Here's the scenario:

    - page gets allocated/mapped by process A. Let's call the anon_vma we
    associate the page with 'A' to keep it easy to track.

    - Process A forks, creating process B. The anon_vma in B is 'B', and has
    a chain that looks like 'B' -> 'A'. Everything is fine.

    - Swapping happens. The page (with mapping pointing to 'A') gets swapped
    out (perhaps not to disk - it's enough to assume that it's just not
    mapped any more, and lives entirely in the swap-cache)

    - Process B pages it in, which goes like this:

    do_swap_page ->
    page = lookup_swap_cache(entry);
    ...
    set_pte_at(mm, address, page_table, pte);
    page_add_anon_rmap(page, vma, address);

    And think about what happens here!

    In particular, what happens is that this will now be the "first"
    mapping of that page, so page_add_anon_rmap() used to do

    if (first)
    __page_set_anon_rmap(page, vma, address);

    and notice what anon_vma it will use? It will use the anon_vma for
    process B!

    What happens then? Trivial: process 'A' also pages it in (nothing
    happens, it's not the first mapping), and then process 'B' execve's
    or exits or unmaps, making anon_vma B go away.

    End result: process A has a page that points to anon_vma B, but
    anon_vma B does not exist any more. This can go on forever. Forget
    about RCU grace periods, forget about locking, forget anything like
    that. The bug is simply that page->mapping points to an anon_vma
    that was correct at one point, but was _not_ the one that was shared
    by all users of that possible mapping.

    Changing it to always use the deepest anon_vma in the anonvma chain gets
    us to the safest model.

    This can be improved in certain cases: if we know the page is private to
    just this particular mapping (for example, it's a new page, or it is the
    only swapcache entry), we could pick the top (most specific) anon_vma.

    But that's a future optimization. Make it _work_ reliably first.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "What do you know, I think you fixed it!" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • We want to walk the chain in reverse order when cloning it, so that the
    order of the result chain will be the same as the order in the source
    chain. When we add entries to the chain, they go at the head of the
    chain, so we want to add the source head last.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "No, it still oopses" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Apr, 2010

1 commit

  • Fix a memory leak in anon_vma_fork(), where we fail to tear down the
    anon_vmas attached to the new VMA in case setting up the new anon_vma
    fails.

    This bug also has the potential to leave behind anon_vma_chain structs
    with pointers to invalid memory.

    Reported-by: Minchan Kim
    Signed-off-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

07 Mar, 2010

3 commits

  • The VM currently assumes that an inactive, mapped and referenced file page
    is in use and promotes it to the active list.

    However, every mapped file page starts out like this and thus a problem
    arises when workloads create a stream of such pages that are used only for
    a short time. By flooding the active list with those pages, the VM
    quickly gets into trouble finding eligible reclaim canditates. The result
    is long allocation latencies and eviction of the wrong pages.

    This patch reuses the PG_referenced page flag (used for unmapped file
    pages) to implement a usage detection that scales with the speed of LRU
    list cycling (i.e. memory pressure).

    If the scanner encounters those pages, the flag is set and the page cycled
    again on the inactive list. Only if it returns with another page table
    reference it is activated. Otherwise it is reclaimed as 'not recently
    used cache'.

    This effectively changes the minimum lifetime of a used-once mapped file
    page from a full memory cycle to an inactive list cycle, which allows it
    to occur in linear streams without affecting the stable working set of the
    system.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When the parent process breaks the COW on a page, both the original which
    is mapped at child and the new page which is mapped parent end up in that
    same anon_vma. Generally this won't be a problem, but for some workloads
    it could preserve the O(N) rmap scanning complexity.

    A simple fix is to ensure that, when a page which is mapped child gets
    reused in do_wp_page, because we already are the exclusive owner, the page
    gets moved to our own exclusive child's anon_vma.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel