17 May, 2007

2 commits

  • Re-introduce rmap verification patches that Hugh removed when he removed
    PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
    anonymous pages, because PG_locked and PTL together already do.

    These checks were important in discovering and fixing a rare rmap corruption
    in SLES9.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 May, 2007

1 commit

  • This implements deferred IO support in fbdev. Deferred IO is a way to delay
    and repurpose IO. This implementation is done using mm's page_mkwrite and
    page_mkclean hooks in order to detect, delay and then rewrite IO. This
    functionality is used by hecubafb.

    [adaplas]
    This is useful for graphics hardware with no directly addressable/mappable
    framebuffer. Implementing this will allow the "framebuffer" to be accesible
    from user space via mmap().

    Signed-off-by: Jaya Kumar
    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaya Kumar
     

08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

27 Apr, 2007

1 commit

  • The page_test_and_clear_dirty primitive really consists of two
    operations, page_test_dirty and the page_clear_dirty. The combination
    of the two is not an atomic operation, so it makes more sense to have
    two separate operations instead of one.
    In addition to the improved readability of the s390 version of
    SetPageUptodate, it now avoids the page_test_dirty operation which is
    an insert-storage-key-extended (iske) instruction which is an expensive
    operation.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

04 Apr, 2007

1 commit

  • The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which
    added the page_test_and_clear_dirty call to page_mkclean and the
    git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes
    the "nasty and subtle race in shared mmap'ed page writeback"
    problem in clear_page_dirty_for_io cause data corruption on s390.

    The effect of the two changes is that for every call to
    clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
    the per page dirty bit is set set_page_dirty is called. Strangly
    clear_page_dirty_for_io is called for not-uptodate pages, e.g.
    over this call-chain:

    [] clear_page_dirty_for_io+0x12a/0x130
    [] generic_writepages+0x258/0x3e0
    [] do_writepages+0x76/0x7c
    [] __writeback_single_inode+0xba/0x3e4
    [] sync_sb_inodes+0x23e/0x398
    [] writeback_inodes+0x12e/0x140
    [] wb_kupdate+0xd2/0x178
    [] pdflush+0x162/0x23c

    The bad news now is that page_test_and_clear_dirty might claim
    that a not-uptodate page is dirty since SetPageUptodate which
    resets the per page dirty bit has not yet been called. The page
    writeback that follows clobbers the data on disk.

    The simplest solution to this problem is to move the call to
    page_test_and_clear_dirty under the "if (page_mapped(page))".
    If a file backed page is mapped it is uptodate.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

02 Mar, 2007

1 commit

  • page_lock_anon_vma() uses spin_lock() to block RCU. This doesn't work with
    PREEMPT_RCU, we have to do rcu_read_lock() explicitely. Otherwise, it is
    theoretically possible that slab returns anon_vma's memory to the system
    before we do spin_unlock(&anon_vma->lock).

    [ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
    yet, and may never be. Regardless, this patch is conceptually the
    right thing to do, even if it doesn't matter at this point. - Linus ]

    Signed-off-by: Oleg Nesterov
    Cc: Paul McKenney
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 Dec, 2006

1 commit


23 Dec, 2006

2 commits


21 Oct, 2006

1 commit


12 Oct, 2006

1 commit

  • We have a persistent dribble of reports of this BUG triggering. Its extended
    diagnostics were recently made conditional on CONFIG_DEBUG_VM, which was a bad
    idea - we want to know about it.

    Signed-off-by: Dave Jones
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     

26 Sep, 2006

1 commit

  • Tracking of dirty pages in shared writeable mmap()s.

    The idea is simple: write protect clean shared writeable pages, catch the
    write-fault, make writeable and set dirty. On page write-back clean all the
    PTE dirty bits and write protect them once again.

    The implementation is a tad harder, mainly because the default
    backing_dev_info capabilities were too loosely maintained. Hence it is not
    enough to test the backing_dev_info for cap_account_dirty.

    The current heuristic is as follows, a VMA is eligible when:
    - its shared writeable
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
    - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
    - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
    - f_op->mmap() didn't change the default page protection

    Page from remap_pfn_range() are explicitly excluded because their COW
    semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
    because they don't have a backing store anyway.

    mprotect() is taught about the new behaviour as well. However it overrides
    the last condition.

    Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
    It can be called on any page, but is currently only implemented for mapped
    pages, if the page is found the be of a VMA that accounts dirty pages it will
    also wrprotect the PTE.

    Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
    under ->private_lock. This seems to be safe, since ->private_lock is used to
    serialize access to the buffers, not the page itself. This is needed because
    clear_page_dirty() will call into page_mkclean() and would thereby violate
    locking order.

    [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

01 Jul, 2006

2 commits

  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

26 Jun, 2006

1 commit

  • Hugh clarified the role of VM_LOCKED. So we can now implement page
    migration for mlocked pages.

    Allow the migration of mlocked pages. This means that try_to_unmap must
    unmap mlocked pages in the migration case.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

5 commits

  • This implements the use of migration entries to preserve ptes of file backed
    pages during migration. Processes can therefore be migrated back and forth
    without loosing their connection to pagecache pages.

    Note that we implement the migration entries only for linear mappings.
    Nonlinear mappings still require the unmapping of the ptes for migration.

    And another writepage() ugliness shows up. writepage() can drop the page
    lock. Therefore we have to remove migration ptes before calling writepages()
    in order to avoid having migration entries point to unlocked pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If we install a migration entry then the rss not really decreases since the
    page is just moved somewhere else. We can save ourselves the work of
    decrementing and later incrementing which will just eventually cause cacheline
    bouncing.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Rip the page migration logic out.

    Remove all code that has to do with swapping during page migration.

    This also guts the ability to migrate pages to swap. No one used that so lets
    let it go for good.

    Page migration should be a bit broken after this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • migrate is a better name since it is only used by page migration.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Mar, 2006

2 commits


10 Mar, 2006

1 commit

  • Remove two early-development BUG_ONs from page_add_file_rmap.

    The pfn_valid test (originally useful for checking that nobody passed an
    artificial struct page) comes too late, since we already have the struct
    page.

    The PageAnon test (useful when anon was first distinguished from file rmap)
    prevents ->nopage implementations from reusing ->mapping, which would
    otherwise be available.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Mar, 2006

1 commit

  • remove_from_swap() currently attempts to use page_lock_anon_vma to obtain
    an anon_vma lock. That is not working since the page may have been
    remapped via swap ptes in order to move the page.

    However, do_migrate_pages() obtain the mmap_sem lock and therefore there is
    a guarantee that the anonymous vma will not vanish from under us. There is
    therefore no need to use page_lock_anon_vma.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Feb, 2006

3 commits

  • Migrate a page with buffers without requiring writeback

    This introduces a new address space operation migratepage() that may be used
    by a filesystem to implement its own version of page migration.

    A version is provided that migrates buffers attached to pages. Some
    filesystems (ext2, ext3, xfs) are modified to utilize this feature.

    The swapper address space operation are modified so that a regular
    migrate_page() will occur for anonymous pages without writeback (migrate_pages
    forces every anonymous page to have a swap entry).

    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add remove_from_swap

    remove_from_swap() allows the restoration of the pte entries that existed
    before page migration occurred for anonymous pages by walking the reverse
    maps. This reduces swap use and establishes regular pte's without the need
    for page faults.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add direct migration support with fall back to swap.

    Direct migration support on top of the swap based page migration facility.

    This allows the direct migration of anonymous pages and the migration of file
    backed pages by dropping the associated buffers (requires writeout).

    Fall back to swap out if necessary.

    The patch is based on lots of patches from the hotplug project but the code
    was restructured, documented and simplified as much as possible.

    Note that an additional patch that defines the migrate_page() method for
    filesystems is necessary in order to avoid writeback for anonymous and file
    backed pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

19 Jan, 2006

1 commit

  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Jan, 2006

1 commit


09 Jan, 2006

1 commit


07 Jan, 2006

2 commits

  • Optimise page_state manipulations by introducing interrupt unsafe accessors
    to page_state fields. Callers must provide their own locking (either
    disable interrupts or not update from interrupt context).

    Switch over the hot callsites that can easily be moved under interrupts off
    sections.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Optimise rmap functions by minimising atomic operations when we know there
    will be no concurrent modifications.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Nov, 2005

1 commit


29 Nov, 2005

2 commits

  • Some users (hi Zwane) have seen a problem when running a workload that
    eats nearly all of physical memory - th system does an OOM kill, even
    when there is still a lot of swap free.

    The problem appears to be a very big task that is holding the swap
    token, and the VM has a very hard time finding any other page in the
    system that is swappable.

    Instead of ignoring the swap token when sc->priority reaches 0, we could
    simply take the swap token away from the memory hog and make sure we
    don't give it back to the memory hog for a few seconds.

    This patch resolves the problem Zwane ran into.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
    explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
    VM area to contain an arbitrary range of page table entries that the VM
    never touches, and never considers to be normal pages.

    Any user of "remap_pfn_range()" automatically gets this new
    functionality, and doesn't even have to mark the pages reserved or
    indeed mark them any other way. It just works. As a side effect, doing
    mmap() on /dev/mem works for arbitrary ranges.

    Sparc update from David in the next commit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Nov, 2005

2 commits

  • copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area,
    zap_pte_range needs to free them, do_wp_page needs to COW them: just like
    ordinary pages, not like the unpaged.

    But recognizing them is a little subtle: because PageReserved is no longer a
    condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the
    distro permits, and whether it's advisable on this or that architecture, is
    another matter). So if we can see a PageAnon, it may not be ours to mess with
    (or may be ours from elsewhere in the address space). I suspect there's an
    entertaining insoluble self-referential problem here, but the page_is_anon
    function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will
    always be an odd choice.

    In updating the comment on page_address_in_vma, noticed a potential NULL
    dereference, in a path we don't actually take, but fixed it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's one peculiar use of VM_RESERVED which the previous patch left behind:
    because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout
    cursor, but should never meet VM_RESERVED vmas, it was a way of extending
    VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose.
    But that's an empty set - they don't have the populate function required. So
    just throw away those VM_RESERVED tests.

    But one more interesting in rmap.c has to go too: try_to_unmap_one will want
    to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Oct, 2005

2 commits

  • Updated several references to page_table_lock in common code comments.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A couple of oddities were guarded by page_table_lock, no longer properly
    guarded when that is split.

    The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
    atomic64_t if the architecture supports it, in such a case. Definitions by
    courtesy of Christoph Lameter: who spent considerable effort on more scalable
    ways of counting, but found insufficient benefit in practice.

    And adding an mm with swap to the mmlist for swapoff: the list is well-
    guarded by its own lock, but the list_empty check now has to be repeated
    inside it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins