28 Apr, 2008

1 commit

  • Nothing in the tree uses nopage any more. Remove support for it in the
    core mm code and documentation (and a few stray references to it in
    comments).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Apr, 2008

1 commit

  • This patch changes the s390 memory management defintions to use the pgste field
    for dirty and reference bit tracking of host and guest code. Usually on s390,
    dirty and referenced are tracked in storage keys, which belong to the physical
    page. This changes with virtualization: The guest and host dirty/reference bits
    are defined to be the logical OR of the values for the mapping and the physical
    page. This patch implements the necessary changes in pgtable.h for s390.

    There is a common code change in mm/rmap.c, the call to
    page_test_and_clear_young must be moved. This is a no-op for all
    architecture but s390. page_referenced checks the referenced bits for
    the physiscal page and for all mappings:
    o The physical page is checked with page_test_and_clear_young.
    o The mappings are checked with ptep_test_and_clear_young and friends.

    Without pgstes (the current implementation on Linux s390) the physical page
    check is implemented but the mapping callbacks are no-ops because dirty
    and referenced are not tracked in the s390 page tables. The pgstes introduces
    guest and host dirty and reference bits for s390 in the host mapping. These
    mapping must be checked before page_test_and_clear_young resets the reference
    bit.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Christian Borntraeger
    Acked-by: Martin Schwidefsky
    Acked-by: Andrew Morton
    Signed-off-by: Carsten Otte
    Signed-off-by: Avi Kivity

    Christian Borntraeger
     

20 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
    it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

    Signed-off-by: Hugh Dickins
    Acked-by: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Feb, 2008

1 commit

  • mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
    is pointing to a specific cgroup. Instead of returning the pointer, we can
    just do the test itself in a new macro:

    vm_match_cgroup(mm, cgroup)

    returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it
    returns zero.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Feb, 2008

2 commits

  • Make page_referenced() cgroup aware. Without this patch, page_referenced()
    can cause a page to be skipped while reclaiming pages. This patch ensures
    that other cgroups do not hold pages in a particular cgroup hostage. It
    is required to ensure that shared pages are freed from a cgroup when they
    are not actively referenced from the cgroup that brought them in

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

2 commits

  • try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
    migrating), and recycles it back to the active list. But if it's an
    anonymous page, we've already allocated swap to it: just wasting swap.
    Spot locked pages in page_referenced_one and treat them as referenced.

    Signed-off-by: Hugh Dickins
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Ethan Solomita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Most pagecache (and some other) radix tree insertions have the great
    opportunity to preallocate a few nodes with relaxed gfp flags. But the
    preallocation is squandered when it comes time to allocate a node, we
    default to first attempting a GFP_ATOMIC allocation -- that doesn't
    normally fail, but it can eat into atomic memory reserves that we don't
    need to be using.

    Another upshot of this is that it removes the sometimes highly contended
    zone->lock from underneath tree_lock. Pagecache insertions are always
    performed with a radix tree preload, and after this change, such a
    situation will never fall back to kmem_cache_alloc within
    radix_tree_node_alloc.

    David Miller reports seeing this allocation fail on a highly threaded
    sparc64 system:

    [527319.459981] dd: page allocation failure. order:0, mode:0x20
    [527319.460403] Call Trace:
    [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
    [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
    [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
    [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
    [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
    [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
    [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
    [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
    [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
    [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
    [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
    [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
    [527319.461361] [00000000004bb920] sys_read+0x34/0x60
    [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

    The calltrace is significant: __do_page_cache_readahead allocates a number
    of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
    memory to satisfy GFP_ATOMIC allocations. However after the list of pages
    goes to mpage_readpages, there can be significant intervals (including disk
    IO) before all the pages are inserted into the radix-tree. So the reserves
    can easily be depleted at that point. The patch is confirmed to fix the
    problem.

    Signed-off-by: Nick Piggin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Nov, 2007

1 commit

  • page_mkclean used to call page_clear_dirty for every given page. This
    is different to all other architectures, where the dirty bit in the
    PTEs is only resetted, if page_mapping() returns a non-NULL pointer.
    We can move the page_test_dirty/page_clear_dirty sequence into the
    2nd if to avoid unnecessary iske/sske sequences, which are expensive.

    This change also helps kvm for s390 as the host must transfer the
    dirty bit into the guest status bits. By moving the page_clear_dirty
    operation into the 2nd if, the vm will only call page_clear_dirty
    for pages where it walks the mapping anyway. There it calls
    ptep_clear_flush for writable ptes, so we can transfer the dirty bit
    to the guest.

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     

15 Nov, 2007

1 commit

  • We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
    mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For
    anon-regions, we just fail to migrate any pages beyond the 1st vma in the
    range.

    This occurs because do_mbind() collects a list of pages to migrate by
    calling check_range(). check_range() walks the task's mm, spanning vmas as
    necessary, to collect the migratable pages into a list. Then, do_mbind()
    calls migrate_pages() passing the list of pages, a function to allocate new
    pages based on vma policy [new_vma_page()], and a pointer to the first vma
    of the range.

    For each page in the list, new_vma_page() calls page_address_in_vma()
    passing the page and the vma [first in range] to obtain the address to get
    for alloc_page_vma(). The page address is needed to get interleaving
    policy correct. If the pages in the list come from multiple vmas,
    eventually, new_page_address() will pass that page to page_address_in_vma()
    with the incorrect vma. For !PageAnon pages, this will result in a bug
    check in rmap.c:vma_address(). For anon pages, vma_address() will just
    return EFAULT and fail the migration.

    This patch modifies new_vma_page() to check the return value from
    page_address_in_vma(). If the return value is EFAULT, new_vma_page()
    searchs forward via vm_next for the vma that maps the page--i.e., that does
    not return EFAULT. This assumes that the pages in the list handed to
    migrate_pages() is in address order. This is currently case. The patch
    documents this assumption in a new comment block for new_vma_page().

    If new_vma_page() cannot locate the vma mapping the page in a forward
    search in the mm, it will pass a NULL vma to alloc_page_vma(). This will
    result in the allocation using the task policy, if any, else system default
    policy. This situation is unlikely, but the patch documents this behavior
    with a comment.

    Note, this patch results in restarting from the first vma in a multi-vma
    range each time new_vma_page() is called. If this is not acceptable, we
    can make the vma argument a pointer, both in new_vma_page() and it's caller
    unmap_and_move() so that the value held by the loop in migrate_pages()
    always passes down the last vma in which a page was found. This will
    require changes to all new_page_t functions passed to migrate_pages(). Is
    this necessary?

    For this patch to work, we can't bug check in vma_address() for pages
    outside the argument vma. This patch removes the BUG_ON(). All other
    callers [besides new_vma_page()] already check the return status.

    Tested on x86_64, 4 node NUMA platform.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

17 Oct, 2007

3 commits

  • zone->lock is quite an "inner" lock and mostly constrained to page alloc as
    well, so like slab locks, it probably isn't something that is critically
    important to document here. However unlike slab locks, zone lock could be
    used more widely in future, and page_alloc.c might possibly have more
    business to do tricky things with pagecache than does slab. So... I don't
    think it hurts to document it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
    set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
    add modfied set_pte() for flushing if necessary.

    This patch flush icache of a page when
    new pte has exec bit.
    && new pte has present bit
    && new pte is user's page.
    && (old *ptep is not present
    || new pte's pfn is not same to old *ptep's ptn)
    && new pte's page has no Pg_arch_1 bit.
    Pg_arch_1 is set when a page is cache consistent.

    I think this condition checks are much easier to understand than considering
    "Where sync_icache_dcache() should be inserted ?".

    pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
    clean-up. So, I added it again.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Luck, Tony"
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Jul, 2007

2 commits

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Jun, 2007

1 commit

  • validate_anon_vma gave a useful check on the integrity of the anon_vma list
    when Andrea was developing obj rmap; but it was not enabled in SLES9
    itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
    configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that
    its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.

    That limit was just an arbitrary number to protect against an infinite
    loop. We could raise it to something enormous (depending on sizeof struct
    vma and size of memory?); but I rather think validate_anon_vma has outlived
    its usefulness, and is better just removed - which gives a magnificent
    performance boost to anything like Petr's test program ;)

    Of course, a very long anon_vma list is bad news for preemption latency,
    and I believe there has been one recent report of such: let's not forget
    that, but validate_anon_vma only makes it worse not better.

    Signed-off-by: Hugh Dickins
    Cc: Petr Vandrovec
    Acked-by: Nick Piggin
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 May, 2007

2 commits

  • Re-introduce rmap verification patches that Hugh removed when he removed
    PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
    anonymous pages, because PG_locked and PTL together already do.

    These checks were important in discovering and fixing a rare rmap corruption
    in SLES9.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 May, 2007

1 commit

  • This implements deferred IO support in fbdev. Deferred IO is a way to delay
    and repurpose IO. This implementation is done using mm's page_mkwrite and
    page_mkclean hooks in order to detect, delay and then rewrite IO. This
    functionality is used by hecubafb.

    [adaplas]
    This is useful for graphics hardware with no directly addressable/mappable
    framebuffer. Implementing this will allow the "framebuffer" to be accesible
    from user space via mmap().

    Signed-off-by: Jaya Kumar
    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaya Kumar
     

08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

27 Apr, 2007

1 commit

  • The page_test_and_clear_dirty primitive really consists of two
    operations, page_test_dirty and the page_clear_dirty. The combination
    of the two is not an atomic operation, so it makes more sense to have
    two separate operations instead of one.
    In addition to the improved readability of the s390 version of
    SetPageUptodate, it now avoids the page_test_dirty operation which is
    an insert-storage-key-extended (iske) instruction which is an expensive
    operation.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

04 Apr, 2007

1 commit

  • The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which
    added the page_test_and_clear_dirty call to page_mkclean and the
    git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes
    the "nasty and subtle race in shared mmap'ed page writeback"
    problem in clear_page_dirty_for_io cause data corruption on s390.

    The effect of the two changes is that for every call to
    clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
    the per page dirty bit is set set_page_dirty is called. Strangly
    clear_page_dirty_for_io is called for not-uptodate pages, e.g.
    over this call-chain:

    [] clear_page_dirty_for_io+0x12a/0x130
    [] generic_writepages+0x258/0x3e0
    [] do_writepages+0x76/0x7c
    [] __writeback_single_inode+0xba/0x3e4
    [] sync_sb_inodes+0x23e/0x398
    [] writeback_inodes+0x12e/0x140
    [] wb_kupdate+0xd2/0x178
    [] pdflush+0x162/0x23c

    The bad news now is that page_test_and_clear_dirty might claim
    that a not-uptodate page is dirty since SetPageUptodate which
    resets the per page dirty bit has not yet been called. The page
    writeback that follows clobbers the data on disk.

    The simplest solution to this problem is to move the call to
    page_test_and_clear_dirty under the "if (page_mapped(page))".
    If a file backed page is mapped it is uptodate.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

02 Mar, 2007

1 commit

  • page_lock_anon_vma() uses spin_lock() to block RCU. This doesn't work with
    PREEMPT_RCU, we have to do rcu_read_lock() explicitely. Otherwise, it is
    theoretically possible that slab returns anon_vma's memory to the system
    before we do spin_unlock(&anon_vma->lock).

    [ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
    yet, and may never be. Regardless, this patch is conceptually the
    right thing to do, even if it doesn't matter at this point. - Linus ]

    Signed-off-by: Oleg Nesterov
    Cc: Paul McKenney
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Dec, 2006

1 commit


23 Dec, 2006

2 commits


21 Oct, 2006

1 commit


12 Oct, 2006

1 commit

  • We have a persistent dribble of reports of this BUG triggering. Its extended
    diagnostics were recently made conditional on CONFIG_DEBUG_VM, which was a bad
    idea - we want to know about it.

    Signed-off-by: Dave Jones
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

26 Sep, 2006

1 commit

  • Tracking of dirty pages in shared writeable mmap()s.

    The idea is simple: write protect clean shared writeable pages, catch the
    write-fault, make writeable and set dirty. On page write-back clean all the
    PTE dirty bits and write protect them once again.

    The implementation is a tad harder, mainly because the default
    backing_dev_info capabilities were too loosely maintained. Hence it is not
    enough to test the backing_dev_info for cap_account_dirty.

    The current heuristic is as follows, a VMA is eligible when:
    - its shared writeable
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
    - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
    - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
    - f_op->mmap() didn't change the default page protection

    Page from remap_pfn_range() are explicitly excluded because their COW
    semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
    because they don't have a backing store anyway.

    mprotect() is taught about the new behaviour as well. However it overrides
    the last condition.

    Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
    It can be called on any page, but is currently only implemented for mapped
    pages, if the page is found the be of a VMA that accounts dirty pages it will
    also wrprotect the PTE.

    Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
    under ->private_lock. This seems to be safe, since ->private_lock is used to
    serialize access to the buffers, not the page itself. This is needed because
    clear_page_dirty() will call into page_mkclean() and would thereby violate
    locking order.

    [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

01 Jul, 2006

2 commits

  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

26 Jun, 2006

1 commit

  • Hugh clarified the role of VM_LOCKED. So we can now implement page
    migration for mlocked pages.

    Allow the migration of mlocked pages. This means that try_to_unmap must
    unmap mlocked pages in the migration case.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

5 commits

  • This implements the use of migration entries to preserve ptes of file backed
    pages during migration. Processes can therefore be migrated back and forth
    without loosing their connection to pagecache pages.

    Note that we implement the migration entries only for linear mappings.
    Nonlinear mappings still require the unmapping of the ptes for migration.

    And another writepage() ugliness shows up. writepage() can drop the page
    lock. Therefore we have to remove migration ptes before calling writepages()
    in order to avoid having migration entries point to unlocked pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If we install a migration entry then the rss not really decreases since the
    page is just moved somewhere else. We can save ourselves the work of
    decrementing and later incrementing which will just eventually cause cacheline
    bouncing.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Rip the page migration logic out.

    Remove all code that has to do with swapping during page migration.

    This also guts the ability to migrate pages to swap. No one used that so lets
    let it go for good.

    Page migration should be a bit broken after this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • migrate is a better name since it is only used by page migration.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Mar, 2006

2 commits