27 Aug, 2009

1 commit

  • An mlocked page might lose the isolatation race. This causes the page to
    clear PG_mlocked while it remains in a VM_LOCKED vma. This means it can
    be put onto the [in]active list. We can rescue it by using try_to_unmap()
    in shrink_page_list().

    But now, As Wu Fengguang pointed out, vmscan has a bug. If the page has
    PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is
    put into the active list. If the page is referenced repeatedly, it can
    remain on the [in]active list without being moving to the unevictable
    list.

    This patch fixes it.

    Reported-by: Wu Fengguang
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro <
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

19 Jun, 2009

1 commit

  • Add file RSS tracking per memory cgroup

    We currently don't track file RSS, the RSS we report is actually anon RSS.
    All the file mapped pages, come in through the page cache and get
    accounted there. This patch adds support for accounting file RSS pages.
    It should

    1. Help improve the metrics reported by the memory resource controller
    2. Will form the basis for a future shared memory accounting heuristic
    that has been proposed by Kamezawa.

    Unfortunately, we cannot rename the existing "rss" keyword used in
    memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
    educate the end user through documentation.

    [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

17 Jun, 2009

2 commits

  • Collect vma->vm_flags of the VMAs that actually referenced the page.

    This is preparing for more informed reclaim heuristics, eg. to protect
    executable file pages more aggressively. For now only the VM_EXEC bit
    will be used by the caller.

    Thanks to Johannes, Peter and Minchan for all the good tips.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

22 May, 2009

1 commit

  • My old address will shut down in a few days time: remove it from the tree,
    and add a tmpfs (shmem filesystem) maintainer entry with the new address.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Feb, 2009

1 commit

  • When I tested following program, I found that the mlocked counter
    is strange. It cannot free some mlocked pages.

    It is because try_to_unmap_file() doesn't check real
    page mappings in vmas.

    That is because the goal of an address_space for a file is to find all
    processes into which the file's specific interval is mapped. It is
    related to the file's interval, not to pages.

    Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
    since the vma has VM_LOCKED, then calls try_to_mlock_page. After this the
    mlocked counter is increased again.

    COWed anon page in a file-backed vma could be a such case. This patch
    resolves it.

    -- my test program --

    int main()
    {
    mlockall(MCL_CURRENT);
    return 0;
    }

    -- before --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 0 kB
    Mlocked: 0 kB

    -- after --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 8 kB
    Mlocked: 8 kB

    Signed-off-by: MinChan Kim
    Acked-by: Lee Schermerhorn
    Acked-by: KOSAKI Motohiro
    Tested-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     

07 Jan, 2009

7 commits

  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now that bad pages are kept out of circulation, there is no need for the
    infamous page_remove_rmap() BUG() - once that page is freed, its negative
    mapcount will issue a "Bad page state" message and the page won't be
    freed. Removing the BUG() allows more info, on subsequent pages, to be
    gathered.

    We do have more info about the page at this point than bad_page() can know
    - notably, what the pmd is, which might pinpoint something like low 64kB
    corruption - but page_remove_rmap() isn't given the address to find that.

    In practice, there is only one call to page_remove_rmap() which has ever
    reported anything, that from zap_pte_range() (usually on exit, sometimes
    on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
    message and leave it all to zap_pte_range().

    mm/memory.c already has a hardly used print_bad_pte() function, showing
    some of the appropriate info: extend it to show what we want for the rmap
    case: pte info, page info (when there is a page) and vma info to compare.
    zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
    use if it works that out for itself.

    Some of this info is also shown in bad_page()'s "Bad page state" message.
    Keep them separate, but adjust them to match each other as far as
    possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
    there too.

    print_bad_pte() show current->comm unconditionally (though it should get
    repeated in the usually irrelevant stack trace): sorry, I misled Nick
    Piggin to make it conditional on vm_mm == current->mm, but current->mm is
    already NULL in the exit case. Usually current->comm is good, though
    exceptionally it may not be that of the mm (when "swapoff" for example).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_lock_anon_vma() and page_unlock_anon_vma() were made available to
    show_page_path() in vmscan.c; but now that has been removed, make them
    static in rmap.c again, they're better kept private if possible.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • File pages mapped only in sequentially read mappings are perfect reclaim
    canditates.

    This patch makes these mappings behave like weak references, their pages
    will be reclaimed unless they have a strong reference from a normal
    mapping as well.

    It changes the reclaim and the unmap path where they check if the page has
    been referenced. In both cases, accesses through sequentially read
    mappings will be ignored.

    Benchmark results from KOSAKI Motohiro:

    http://marc.info/?l=linux-mm&m=122485301925098&w=2

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

20 Oct, 2008

5 commits

  • This patch makes the needlessly global anon_vma_cachep static.

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There are not-on-LRU pages which can be mapped and they are not worth to
    be accounted. (becasue we can't shrink them and need dirty codes to
    handle specical case) We'd like to make use of usual objrmap/radix-tree's
    protcol and don't want to account out-of-vm's control pages.

    When special_mapping_fault() is called, page->mapping is tend to be NULL
    and it's charged as Anonymous page. insert_page() also handles some
    special pages from drivers.

    This patch is for avoiding to account special pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds a function to scan individual or all zones' unevictable
    lists and move any pages that have become evictable onto the respective
    zone's inactive list, where shrink_inactive_list() will deal with them.

    Adds sysctl to scan all nodes, and per node attributes to individual
    nodes' zones.

    Kosaki: If evictable page found in unevictable lru when write
    /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
    these pages.

    [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
    [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The anon_vma code is very subtle, and we end up doing optimistic lookups
    of anon_vmas under RCU in page_lock_anon_vma() with no locking. Other
    CPU's can also see the newly allocated entry immediately after we've
    exposed it by setting "vma->anon_vma" to the new value.

    We protect against the anon_vma being destroyed by having the SLAB
    marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
    allocation not being destroyed - but it might still be free'd and
    re-allocated here to a new vma.

    As a result, we should not do the anon_vma list ops on a newly allocated
    vma without proper locking.

    Acked-by: Nick Piggin
    Acked-by: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Aug, 2008

2 commits

  • There is a race with dirty page accounting where a page may not properly
    be accounted for.

    clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

    page_mkclean walks the rmaps for that page, and for each one it cleans and
    write protects the pte if it was dirty. It uses page_check_address to
    find the pte. That function has a shortcut to avoid the ptl if the pte is
    not present. Unfortunately, the pte can be switched to not-present then
    back to present by other code while holding the page table lock -- this
    should not be a signal for page_mkclean to ignore that pte, because it may
    be dirty.

    For example, powerpc64's set_pte_at will clear a previously present pte
    before setting it to the desired value. There may also be other code in
    core mm or in arch which do similar things.

    The consequence of the bug is loss of data integrity due to msync, and
    loss of dirty page accounting accuracy. XIP's __xip_unmap could easily
    also be unreliable (depending on the exact XIP locking scheme), which can
    lead to data corruption.

    Fix this by having an option to always take ptl to check the pte in
    page_check_address.

    It's possible to retain this optimization for page_referenced and
    try_to_unmap.

    Signed-off-by: Nick Piggin
    Cc: Jared Hulbert
    Cc: Carsten Otte
    Cc: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty
    dance in page_remove_rmap(): I was wrong to think the PageSwapCache test
    could be avoided, and would like a comment in there to remind me. And
    mention s390, to help us remember that this block is not really common.

    Also move down the "It would be tidy to reset PageAnon" comment: it does
    not belong to s390's block, and it would be unwise to reset PageAnon
    before we're done with testing it.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Aug, 2008

1 commit

  • For anonymous pages without a swap cache backing the check in
    page_remove_rmap for the physical dirty bit in page_remove_rmap is
    unnecessary. The instructions that are used to check and reset the dirty
    bit are expensive. Removing the check noticably speeds up process exit.
    In addition the clearing of the dirty bit in __SetPageUptodate is
    pointless as well. With these two changes there is no storage key
    operation for an anonymous page anymore if it does not hit the swap
    space.

    The micro benchmark which repeatedly executes an empty shell script
    gets about 5% faster.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

26 Jul, 2008

1 commit

  • memcg: performance improvements

    Patch Description
    1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
    2/5 ... swapcache handling patch
    3/5 ... add helper function for shmem's memory reclaim patch
    4/5 ... optimize by likely/unlikely ppatch
    5/5 ... remove redundunt check patch (shmem handling is fixed.)

    Unix bench result.

    == 2.6.26-rc2-mm1 + memory resource controller
    Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
    C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 2915.4 678.0
    File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
    File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
    File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
    Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
    =========
    FINAL SCORE 991.3

    == 2.6.26-rc2-mm1 + this set ==
    Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
    C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
    Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
    Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
    File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
    File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
    File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
    File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
    File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
    File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
    File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
    File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
    File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
    Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

    INDEX VALUES
    TEST BASELINE RESULT INDEX

    Execl Throughput 43.0 3012.9 700.7
    File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
    File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
    File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
    Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
    =========
    FINAL SCORE 1004.6

    This patch:

    Remove refcnt from page_cgroup().

    After this,

    * A page is charged only when !page_mapped() && no page_cgroup is assigned.
    * Anon page is newly mapped.
    * File page is added to mapping->tree.

    * A page is uncharged only when
    * Anon page is fully unmapped.
    * File page is removed from LRU.

    There is no change in behavior from user's view.

    This patch also removes unnecessary calls in rmap.c which was used only for
    refcnt mangement.

    [akpm@linux-foundation.org: fix warning]
    [hugh@veritas.com: fix shmem_unuse_inode charging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Apr, 2008

1 commit

  • Nothing in the tree uses nopage any more. Remove support for it in the
    core mm code and documentation (and a few stray references to it in
    comments).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Apr, 2008

1 commit

  • This patch changes the s390 memory management defintions to use the pgste field
    for dirty and reference bit tracking of host and guest code. Usually on s390,
    dirty and referenced are tracked in storage keys, which belong to the physical
    page. This changes with virtualization: The guest and host dirty/reference bits
    are defined to be the logical OR of the values for the mapping and the physical
    page. This patch implements the necessary changes in pgtable.h for s390.

    There is a common code change in mm/rmap.c, the call to
    page_test_and_clear_young must be moved. This is a no-op for all
    architecture but s390. page_referenced checks the referenced bits for
    the physiscal page and for all mappings:
    o The physical page is checked with page_test_and_clear_young.
    o The mappings are checked with ptep_test_and_clear_young and friends.

    Without pgstes (the current implementation on Linux s390) the physical page
    check is implemented but the mapping callbacks are no-ops because dirty
    and referenced are not tracked in the s390 page tables. The pgstes introduces
    guest and host dirty and reference bits for s390 in the host mapping. These
    mapping must be checked before page_test_and_clear_young resets the reference
    bit.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Christian Borntraeger
    Acked-by: Martin Schwidefsky
    Acked-by: Andrew Morton
    Signed-off-by: Carsten Otte
    Signed-off-by: Avi Kivity

    Christian Borntraeger
     

20 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
    it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

    Signed-off-by: Hugh Dickins
    Acked-by: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Feb, 2008

1 commit

  • mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
    is pointing to a specific cgroup. Instead of returning the pointer, we can
    just do the test itself in a new macro:

    vm_match_cgroup(mm, cgroup)

    returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it
    returns zero.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Feb, 2008

2 commits

  • Make page_referenced() cgroup aware. Without this patch, page_referenced()
    can cause a page to be skipped while reclaiming pages. This patch ensures
    that other cgroups do not hold pages in a particular cgroup hostage. It
    is required to ensure that shared pages are freed from a cgroup when they
    are not actively referenced from the cgroup that brought them in

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

2 commits

  • try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
    migrating), and recycles it back to the active list. But if it's an
    anonymous page, we've already allocated swap to it: just wasting swap.
    Spot locked pages in page_referenced_one and treat them as referenced.

    Signed-off-by: Hugh Dickins
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Ethan Solomita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Most pagecache (and some other) radix tree insertions have the great
    opportunity to preallocate a few nodes with relaxed gfp flags. But the
    preallocation is squandered when it comes time to allocate a node, we
    default to first attempting a GFP_ATOMIC allocation -- that doesn't
    normally fail, but it can eat into atomic memory reserves that we don't
    need to be using.

    Another upshot of this is that it removes the sometimes highly contended
    zone->lock from underneath tree_lock. Pagecache insertions are always
    performed with a radix tree preload, and after this change, such a
    situation will never fall back to kmem_cache_alloc within
    radix_tree_node_alloc.

    David Miller reports seeing this allocation fail on a highly threaded
    sparc64 system:

    [527319.459981] dd: page allocation failure. order:0, mode:0x20
    [527319.460403] Call Trace:
    [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
    [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
    [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
    [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
    [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
    [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
    [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
    [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
    [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
    [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
    [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
    [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
    [527319.461361] [00000000004bb920] sys_read+0x34/0x60
    [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

    The calltrace is significant: __do_page_cache_readahead allocates a number
    of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
    memory to satisfy GFP_ATOMIC allocations. However after the list of pages
    goes to mpage_readpages, there can be significant intervals (including disk
    IO) before all the pages are inserted into the radix-tree. So the reserves
    can easily be depleted at that point. The patch is confirmed to fix the
    problem.

    Signed-off-by: Nick Piggin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Nov, 2007

1 commit

  • page_mkclean used to call page_clear_dirty for every given page. This
    is different to all other architectures, where the dirty bit in the
    PTEs is only resetted, if page_mapping() returns a non-NULL pointer.
    We can move the page_test_dirty/page_clear_dirty sequence into the
    2nd if to avoid unnecessary iske/sske sequences, which are expensive.

    This change also helps kvm for s390 as the host must transfer the
    dirty bit into the guest status bits. By moving the page_clear_dirty
    operation into the 2nd if, the vm will only call page_clear_dirty
    for pages where it walks the mapping anyway. There it calls
    ptep_clear_flush for writable ptes, so we can transfer the dirty bit
    to the guest.

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     

15 Nov, 2007

1 commit

  • We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
    mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For
    anon-regions, we just fail to migrate any pages beyond the 1st vma in the
    range.

    This occurs because do_mbind() collects a list of pages to migrate by
    calling check_range(). check_range() walks the task's mm, spanning vmas as
    necessary, to collect the migratable pages into a list. Then, do_mbind()
    calls migrate_pages() passing the list of pages, a function to allocate new
    pages based on vma policy [new_vma_page()], and a pointer to the first vma
    of the range.

    For each page in the list, new_vma_page() calls page_address_in_vma()
    passing the page and the vma [first in range] to obtain the address to get
    for alloc_page_vma(). The page address is needed to get interleaving
    policy correct. If the pages in the list come from multiple vmas,
    eventually, new_page_address() will pass that page to page_address_in_vma()
    with the incorrect vma. For !PageAnon pages, this will result in a bug
    check in rmap.c:vma_address(). For anon pages, vma_address() will just
    return EFAULT and fail the migration.

    This patch modifies new_vma_page() to check the return value from
    page_address_in_vma(). If the return value is EFAULT, new_vma_page()
    searchs forward via vm_next for the vma that maps the page--i.e., that does
    not return EFAULT. This assumes that the pages in the list handed to
    migrate_pages() is in address order. This is currently case. The patch
    documents this assumption in a new comment block for new_vma_page().

    If new_vma_page() cannot locate the vma mapping the page in a forward
    search in the mm, it will pass a NULL vma to alloc_page_vma(). This will
    result in the allocation using the task policy, if any, else system default
    policy. This situation is unlikely, but the patch documents this behavior
    with a comment.

    Note, this patch results in restarting from the first vma in a multi-vma
    range each time new_vma_page() is called. If this is not acceptable, we
    can make the vma argument a pointer, both in new_vma_page() and it's caller
    unmap_and_move() so that the value held by the loop in migrate_pages()
    always passes down the last vma in which a page was found. This will
    require changes to all new_page_t functions passed to migrate_pages(). Is
    this necessary?

    For this patch to work, we can't bug check in vma_address() for pages
    outside the argument vma. This patch removes the BUG_ON(). All other
    callers [besides new_vma_page()] already check the return status.

    Tested on x86_64, 4 node NUMA platform.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

17 Oct, 2007

3 commits

  • zone->lock is quite an "inner" lock and mostly constrained to page alloc as
    well, so like slab locks, it probably isn't something that is critically
    important to document here. However unlike slab locks, zone lock could be
    used more widely in future, and page_alloc.c might possibly have more
    business to do tricky things with pagecache than does slab. So... I don't
    think it hurts to document it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
    set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
    add modfied set_pte() for flushing if necessary.

    This patch flush icache of a page when
    new pte has exec bit.
    && new pte has present bit
    && new pte is user's page.
    && (old *ptep is not present
    || new pte's pfn is not same to old *ptep's ptn)
    && new pte's page has no Pg_arch_1 bit.
    Pg_arch_1 is set when a page is cache consistent.

    I think this condition checks are much easier to understand than considering
    "Where sync_icache_dcache() should be inserted ?".

    pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
    clean-up. So, I added it again.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt