29 Aug, 2010

1 commit

  • After several hours, kbuild tests hang with anon_vma_prepare() spinning on
    a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
    (which makes this very much more likely, but it could happen without).

    The ever-subtle page_lock_anon_vma() now needs a further twist: since
    anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
    of a reused anon_vma structure at any moment, page_lock_anon_vma()
    needs to check page_mapped() again before succeeding, otherwise
    page_unlock_anon_vma() might address a different root->lock.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Aug, 2010

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     

11 Aug, 2010

2 commits

  • CONFIG_HUGETLBFS controls hugetlbfs interface code.
    OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code.
    So we should use CONFIG_HUGETLB_PAGE here.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

8 commits

  • On swapin it is fairly common for a page to be owned exclusively by one
    process. In that case we want to add the page to the anon_vma of that
    process's VMA, instead of to the root anon_vma.

    This will reduce the amount of rmap searching that the swapout code needs
    to do.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Verify the refcounting doesn't go wrong, and resurrect the check in
    __page_check_anon_rmap as in old anon-vma code.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With root anon-vma it's trivial to keep doing the usual check as in
    old-anon-vma code.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Always use anon_vma->root pointer instead of anon_vma_chain.prev.

    Also optimize the map-paths, if a mapping is already established no need
    to overwrite it with root anon-vma list, we can keep the more finegrined
    anon-vma and skip the overwrite: see the PageAnon check in !exclusive
    case. This is also the optimization that hidden the ksm bug as this tends
    to make ksm_might_need_to_copy skip the copy, but only the proper fix to
    ksm_might_need_to_copy guarantees not triggering the ksm bug unless ksm is
    in use. this is an optimization only...

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    [kamezawa.hiroyu@jp.fujitsu.com: fix false positive BUG_ON in __page_set_anon_rmap]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Make sure to always add new VMAs at the end of the list. This is
    important so rmap_walk does not miss a VMA that was created during the
    rmap_walk.

    The old code got this right most of the time due to luck, but was buggy
    when anon_vma_prepare reused a mergeable anon_vma.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • KSM reference counts can cause an anon_vma to exist after the processe it
    belongs to have already exited. Because the anon_vma lock now lives in
    the root anon_vma, we need to ensure that the root anon_vma stays around
    until after all the "child" anon_vmas have been freed.

    The obvious way to do this is to have a "child" anon_vma take a reference
    to the root in anon_vma_fork. When the anon_vma is freed at munmap or
    process exit, we drop the refcount in anon_vma_unlink and possibly free
    the root anon_vma.

    The KSM anon_vma reference count function also needs to be modified to
    deal with the possibility of freeing 2 levels of anon_vma. The easiest
    way to do this is to break out the KSM magic and make it generic.

    When compiling without CONFIG_KSM, this code is compiled out.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Tested-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Track the root (oldest) anon_vma in each anon_vma tree. Because we only
    take the lock on the root anon_vma, we cannot use the lock on higher-up
    anon_vmas to lock anything. This makes it impossible to do an indirect
    lookup of the root anon_vma, since the data structures could go away from
    under us.

    However, a direct pointer is safe because the root anon_vma is always the
    last one that gets freed on munmap or exit, by virtue of the same_vma list
    order and unlink_anon_vmas walking the list forward.

    [akpm@linux-foundation.org: fix typo]
    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2010

3 commits

  • …tion by not migrating temporary stacks

    Page migration requires rmap to be able to find all ptes mapping a page
    at all times, otherwise the migration entry can be instantiated, but it
    is possible to leave one behind if the second rmap_walk fails to find
    the page. If this page is later faulted, migration_entry_to_page() will
    call BUG because the page is locked indicating the page was migrated by
    the migration PTE not cleaned up. For example

    kernel BUG at include/linux/swapops.h:105!
    invalid opcode: 0000 [#1] PREEMPT SMP
    ...
    Call Trace:
    [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
    [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
    [<ffffffff813099b5>] page_fault+0x25/0x30
    [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
    [<ffffffff8111329b>] search_binary_handler+0x173/0x313
    [<ffffffff81114896>] do_execve+0x219/0x30a
    [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
    [<ffffffff8100320a>] stub_execve+0x6a/0xc0
    RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

    There is a race between shift_arg_pages and migration that triggers this
    bug. A temporary stack is setup during exec and later moved. If
    migration moves a page in the temporary stack and the VMA is then removed
    before migration completes, the migration PTE may not be found leading to
    a BUG when the stack is faulted.

    This patch causes pages within the temporary stack during exec to be
    skipped by migration. It does this by marking the VMA covering the
    temporary stack with an otherwise impossible combination of VMA flags.
    These flags are cleared when the temporary stack is moved to its final
    location.

    [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks. The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory. For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim). Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.

    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner. The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list. The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list. As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area. A compaction
    run completes within a zone when the two scanners meet.

    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis. This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.

    It also does not try relocate virtually contiguous pages to be physically
    contiguous. However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.

    Memory compaction can be triggered in one of three ways. It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory. It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted. When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim. Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.

    The series is in 14 patches. The first three are not "core" to the series
    but are important pre-requisites.

    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    patch, it's possible to use anon_vma after free if the caller is
    not holding a VMA or mmap_sem for the pages in question. While
    there should be no existing user that causes this problem,
    it's a requirement for memory compaction to be stable. The patch
    is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    guarantees about the anon_vma existing. There is a window between
    when a page was isolated and migration started during which anon_vma
    could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    a measure of external fragmentation that takes the size of the
    allocation request into account. It can also be calculated from
    userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    allocation request fails. It determines if an allocation failure
    would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    after compaction.

    Testing of compaction was in three stages. For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out. min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations. It was tested on X86,
    X86-64 and PPC64.

    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.

    1. Machine freshly booted and configured for hugepage usage with
    a) hugeadm --create-global-mounts
    b) hugeadm --pool-pages-max DEFAULT:8G
    c) hugeadm --set-recommended-min_free_kbytes
    d) hugeadm --set-recommended-shmmax

    The min_free_kbytes here is important. Anti-fragmentation works best
    when pageblocks don't mix. hugeadm knows how to calculate a value that
    will significantly reduce the worst of external-fragmentation-related
    events as reported by the mm_page_alloc_extfrag tracepoint.

    2. Load up memory
    a) Start updatedb
    b) Create in parallel a X files of pagesize*128 in size. Wait
    until files are created. By parallel, I mean that 4096 instances
    of dd were launched, one after the other using &. The crude
    objective being to mix filesystem metadata allocations with
    the buffer cache.
    c) Delete every second file so that pageblocks are likely to
    have holes
    d) kill updatedb if it's still running

    At this point, the system is quiet, memory is full but it's full with
    clean filesystem metadata and clean buffer cache that is unmapped.
    This is readily migrated or discarded so you'd expect lumpy reclaim
    to have no significant advantage over compaction but this is at
    the POC stage.

    3. In increments, attempt to allocate 5% of memory as hugepages.
    Measure how long it took, how successful it was, how many
    direct reclaims took place and how how many compactions. Note
    the compaction figures might not fully add up as compactions
    can take place for orders other than the hugepage size

    X86 vanilla compaction
    Final page count 913 916 (attempted 1002)
    pages reclaimed 68296 9791

    X86-64 vanilla compaction
    Final page count: 901 902 (attempted 1002)
    Total pages reclaimed: 112599 53234

    PPC64 vanilla compaction
    Final page count: 93 94 (attempted 110)
    Total pages reclaimed: 103216 61838

    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either. What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.

    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench. None showed anything too remarkable.

    The last test was a high-order allocation stress test. Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations. During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.

    vanilla compaction
    Percentage of request allocated X86 98 99
    Percentage of request allocated X86-64 95 98
    Percentage of request allocated PPC64 55 70

    This patch:

    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.

    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated. This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 May, 2010

1 commit

  • Currently page_address_in_vma() compares vma->anon_vma and
    page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have
    multiple anon_vmas with anon_vma_chain, so current check does not work.
    (For anonymous page shared by multiple processes, some verified (page,vma)
    pairs return -EFAULT wrongly.)

    We can go to checking all anon_vmas in the "same_vma" chain, but it needs
    to meet lock requirement. Instead, we can remove anon_vma check safely
    because page_address_in_vma() assumes that page and vma are already
    checked to belong to the identical process.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Rik van Riel
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

25 Apr, 2010

1 commit

  • If find_mergeable_anon_vma() succeeds but another thread installs
    ->anon_vma before we take ptl, then allocated == NULL but avc should be
    freed. Change the code to check avc != NULL to detect this case.

    Also, a couple of whitespace changes to make the critical section more
    visible.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Pete Zaitcev
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 Apr, 2010

1 commit

  • The recent anon_vma fixes cause many anonymous pages to end up
    in the parent process anon_vma, even when the page is exclusively
    owned by the current process.

    Adding exclusively owned anonymous pages to the top anon_vma
    reduces rmap scanning overhead, especially in workloads with
    forking servers.

    This patch adds a parameter to __page_set_anon_rmap that can
    be used to indicate whether or not the added page is exclusively
    owned by the current process.

    Pages added through page_add_new_anon_rmap are exclusively
    owned by the current process, and can be added to the top
    anon_vma.

    Pages added through page_add_anon_rmap can be either shared
    or exclusively owned, so we do the conservative thing and
    add it to the oldest anon_vma.

    A next step would be to add the exclusive parameter to
    page_add_anon_rmap, to be used from functions where we do
    know for sure whether a page is exclusively owned.

    Signed-off-by: Rik van Riel
    Reviewed-by: Johannes Weiner
    Lightly-tested-by: Borislav Petkov
    Reviewed-by: Minchan Kim
    [ Edited to look nicer - Linus ]
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

13 Apr, 2010

2 commits

  • Otherwise we might be mapping in a page in a new mapping, but that page
    (through the swapcache) would later be mapped into an old mapping too.
    The page->mapping must be the case that works for everybody, not just
    the mapping that happened to page it in first.

    Here's the scenario:

    - page gets allocated/mapped by process A. Let's call the anon_vma we
    associate the page with 'A' to keep it easy to track.

    - Process A forks, creating process B. The anon_vma in B is 'B', and has
    a chain that looks like 'B' -> 'A'. Everything is fine.

    - Swapping happens. The page (with mapping pointing to 'A') gets swapped
    out (perhaps not to disk - it's enough to assume that it's just not
    mapped any more, and lives entirely in the swap-cache)

    - Process B pages it in, which goes like this:

    do_swap_page ->
    page = lookup_swap_cache(entry);
    ...
    set_pte_at(mm, address, page_table, pte);
    page_add_anon_rmap(page, vma, address);

    And think about what happens here!

    In particular, what happens is that this will now be the "first"
    mapping of that page, so page_add_anon_rmap() used to do

    if (first)
    __page_set_anon_rmap(page, vma, address);

    and notice what anon_vma it will use? It will use the anon_vma for
    process B!

    What happens then? Trivial: process 'A' also pages it in (nothing
    happens, it's not the first mapping), and then process 'B' execve's
    or exits or unmaps, making anon_vma B go away.

    End result: process A has a page that points to anon_vma B, but
    anon_vma B does not exist any more. This can go on forever. Forget
    about RCU grace periods, forget about locking, forget anything like
    that. The bug is simply that page->mapping points to an anon_vma
    that was correct at one point, but was _not_ the one that was shared
    by all users of that possible mapping.

    Changing it to always use the deepest anon_vma in the anonvma chain gets
    us to the safest model.

    This can be improved in certain cases: if we know the page is private to
    just this particular mapping (for example, it's a new page, or it is the
    only swapcache entry), we could pick the top (most specific) anon_vma.

    But that's a future optimization. Make it _work_ reliably first.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "What do you know, I think you fixed it!" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • We want to walk the chain in reverse order when cloning it, so that the
    order of the result chain will be the same as the order in the source
    chain. When we add entries to the chain, they go at the head of the
    chain, so we want to add the source head last.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "No, it still oopses" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Apr, 2010

1 commit

  • Fix a memory leak in anon_vma_fork(), where we fail to tear down the
    anon_vmas attached to the new VMA in case setting up the new anon_vma
    fails.

    This bug also has the potential to leave behind anon_vma_chain structs
    with pointers to invalid memory.

    Reported-by: Minchan Kim
    Signed-off-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

07 Mar, 2010

7 commits

  • The VM currently assumes that an inactive, mapped and referenced file page
    is in use and promotes it to the active list.

    However, every mapped file page starts out like this and thus a problem
    arises when workloads create a stream of such pages that are used only for
    a short time. By flooding the active list with those pages, the VM
    quickly gets into trouble finding eligible reclaim canditates. The result
    is long allocation latencies and eviction of the wrong pages.

    This patch reuses the PG_referenced page flag (used for unmapped file
    pages) to implement a usage detection that scales with the speed of LRU
    list cycling (i.e. memory pressure).

    If the scanner encounters those pages, the flag is set and the page cycled
    again on the inactive list. Only if it returns with another page table
    reference it is activated. Otherwise it is reclaimed as 'not recently
    used cache'.

    This effectively changes the minimum lifetime of a used-once mapped file
    page from a full memory cycle to an inactive list cycle, which allows it
    to occur in linear streams without affecting the stable working set of the
    system.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When the parent process breaks the COW on a page, both the original which
    is mapped at child and the new page which is mapped parent end up in that
    same anon_vma. Generally this won't be a problem, but for some workloads
    it could preserve the O(N) rmap scanning complexity.

    A simple fix is to ensure that, when a page which is mapped child gets
    reused in do_wp_page, because we already are the exclusive owner, the page
    gets moved to our own exclusive child's anon_vma.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When an anonymous page is inherited from a parent process, the
    vma->anon_vma can differ from the page anon_vma. This can trip up
    __page_check_anon_rmap, which is indirectly called from do_swap_page().

    Remove that obsolete check to prevent an oops.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

12 commits

  • In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
    grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED

    And in global VM, mapped shared memory is accounted into FILE_MAPPED.
    But memcg doesn't. fix it.
    Note:
    page_is_file_cache() just checks SwapBacked or not.
    So, we need to check PageAnon.

    Cc: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
    unevictable-lru". So, following code is easy confusable.

    if (vma->vm_flags & VM_LOCKED) {
    ret = SWAP_MLOCK;
    goto out_unmap;
    }

    Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
    the needed of calling mlock_vma_page().

    Also, add some commentary to try_to_unmap_one().

    Acked-by: Hugh Dickins
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When ksm pages were unswappable, it made no sense to include them in mem
    cgroup accounting; but now that they are swappable (although I see no
    strict logical connection) the principle of least surprise implies that
    they should be accounted (with the usual dissatisfaction, that a shared
    page is accounted to only one of the cgroups using it).

    This patch was intended to add mem cgroup accounting where necessary; but
    turned inside out, it now avoids allocating a ksm page, instead upgrading
    an anon page to ksm - which brings its existing mem cgroup accounting with
    it. Thus mem cgroups don't appear in the patch at all.

    This upgrade from PageAnon to PageKsm takes place under page lock (via a
    somewhat hacky NULL kpage interface), and audit showed only one place
    which needed to cope with the race - page_referenced() is sometimes used
    without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
    sure of getting anon_vma and flags together (no problem if the page goes
    ksm an instant after, the integrity of that anon_vma list is unaffected).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For full functionality, page_referenced_one() and try_to_unmap_one() need
    to know the vma: to pass vma down to arch-dependent flushes, or to observe
    VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
    vmas get split and merged without its knowledge.

    Instead, note page's anon_vma in its rmap_item when adding to stable tree:
    all the vmas which might map that page are listed by its anon_vma.

    page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
    first to find the probable vma, that which matches rmap_item's mm; but if
    that is not enough to locate all instances, traverse again to try the
    others. This catches those occasions when fork has duplicated a pte of a
    ksm page, but ksmd has not yet come around to assign it an rmap_item.

    But each rmap_item in the stable tree which refers to an anon_vma needs to
    take a reference to it. Andrea's anon_vma design cleverly avoided a
    reference count (an anon_vma was free when its list of vmas was empty),
    but KSM now needs to add that. Is a 32-bit count sufficient? I believe
    so - the anon_vma is only free when both count is 0 and list is empty.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM swapping will know where page_referenced_one() and try_to_unmap_one()
    should look. It could hack page->index to get them to do what it wants,
    but it seems cleaner now to pass the address down to them.

    Make the same change to page_mkclean_one(), since it follows the same
    pattern; but there's no real need in its case.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's contorted mlock/munlock handling in try_to_unmap_anon() and
    try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
    Simplify it by moving it all down into try_to_unmap_one().

    One thing is then lost, try_to_munlock()'s distinction between when no vma
    holds the page mlocked, and when a vma does mlock it, but we could not get
    mmap_sem to set the page flag. But its only caller takes no interest in
    that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
    the code simple and return SWAP_AGAIN for both cases.

    try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
    amusing: once unravelled, it turns out to have been choosing between two
    different ways of doing the same nothing. Ah, no, one way was actually
    returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

    [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
    [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When the code jumps to the `out', `referenced' is still zero. So there is
    no need to check it.

    Signed-off-by: Huang Shijie
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Just simplify the code when `mlocked' is true.

    Signed-off-by: Huang Shijie
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie