07 Jan, 2009

37 commits

  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_lock_anon_vma() and page_unlock_anon_vma() were made available to
    show_page_path() in vmscan.c; but now that has been removed, make them
    static in rmap.c again, they're better kept private if possible.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code is over-provisioned with BUG_ONs on assorted page flags,
    mostly dating back to 2.3. They're good documentation, and guard against
    developer error, but a waste of space on most systems: change them to
    VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
    ones: they're later, from 2.5.69, but even less interesting now.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
    that harder to track down: remove it, and its out-of-work brothers
    GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.

    Since we're making that improvement to hotremove_migrate_alloc(), I think
    we can now also remove one of the "o"s from its comment.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • cgroup_mm_owner_callbacks() was brought in to support the memrlimit
    controller, but sneaked into mainline ahead of it. That controller has
    now been shelved, and the mm_owner_changed() args were inadequate for it
    anyway (they needed an mm pointer instead of a task pointer).

    Remove the dead code, and restore mm_update_next_owner() locking to how it
    was before: taking mmap_sem there does nothing for memcontrol.c, now the
    only user of mm->owner.

    Signed-off-by: Hugh Dickins
    Cc: Paul Menage
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It is known that buffer_mapped() is false in this code path.

    Signed-off-by: Franck Bui-Huu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Make the pte-level function in apply_to_range be called in lazy mmu mode,
    so that any pagetable modifications can be batched.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Venkatesh Pallipadi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Lazy unmapping in the vmalloc code has now opened the possibility for use
    after free bugs to go undetected. We can catch those by forcing an unmap
    and flush (which is going to be slow, but that's what happens).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The vmalloc purge lock can be a mutex so we can sleep while a purge is
    going on (purge involves a global kernel TLB invalidate, so it can take
    quite a while).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we do that, output of files like /proc/vmallocinfo will show things
    like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
    caller. This info is not as useful as the real caller of the allocation.

    So, proposal is to call __vmalloc_node node directly, with matching
    parameters to save the caller information

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • If we can't service a vmalloc allocation, show size of the allocation that
    actually failed. Useful for debugging.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • File pages mapped only in sequentially read mappings are perfect reclaim
    canditates.

    This patch makes these mappings behave like weak references, their pages
    will be reclaimed unless they have a strong reference from a normal
    mapping as well.

    It changes the reclaim and the unmap path where they check if the page has
    been referenced. In both cases, accesses through sequentially read
    mappings will be ignored.

    Benchmark results from KOSAKI Motohiro:

    http://marc.info/?l=linux-mm&m=122485301925098&w=2

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • speculative page references patch (commit:
    e286781d5f2e9c846e012a39653a166e9d31777d) removed last
    pagevec_release_nonlru() caller.

    So this function can be removed now.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Don't print the size of the zone's memmap array if it does not have one.

    Impact: cleanup

    Signed-off-by: Yinghai Lu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade
     
  • Chris Mason notices do_sync_mapping_range didn't actually ask for data
    integrity writeout. Unfortunately, it is advertised as being usable for
    data integrity operations.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Now that we have the early-termination logic in place, it makes sense to
    bail out early in all other cases where done is set to 1.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Terminate the write_cache_pages loop upon encountering the first page past
    end, without locking the page. Pages cannot have their index change when
    we have a reference on them (truncate, eg truncate_inode_pages_range
    performs the same check without the page lock).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if we get stuck behind another process that is
    cleaning pages, we will be forced to wait for them to finish, then perform
    our own writeout (if it was redirtied during the long wait), then wait for
    that.

    If a page under writeout is still clean, we can skip waiting for it (if
    we're part of a data integrity sync, we'll be waiting for all writeout
    pages afterwards, so we'll still be waiting for the other guy's write
    that's cleaned the page).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Get rid of some complex expressions from flow control statements, add a
    comment, remove some duplicate code.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if ret signals a real error, but we still have some
    pages left in the pagevec, done would be set to 1, but the remaining pages
    would continue to be processed and ret will be overwritten in the process.

    It could easily be overwritten with success, and thus success will be
    returned even if there is an error. Thus the caller is told all writes
    succeeded, wheras in reality some did not.

    Fix this by bailing immediately if there is an error, and retaining the
    first error code.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We'd like to break out of the loop early in many situations, however the
    existing code has been setting mapping->writeback_index past the final
    page in the pagevec lookup for cyclic writeback. This is a problem if we
    don't process all pages up to the final page.

    Currently the code mostly keeps writeback_index reasonable and hacked
    around this by not breaking out of the loop or writing pages outside the
    range in these cases. Keep track of a real "done index" that enables us
    to terminate the loop in a much more flexible manner.

    Needed by the subsequent patch to preserve writepage errors, and then
    further patches to break out of the loop early for other reasons. However
    there are no functional changes with this patch alone.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, scanned == 1 is supposed to mean that cyclic
    writeback has circled through zero, thus we should not circle again.
    However it gets set to 1 after the first successful pagevec lookup. This
    leads to cases where not enough data gets written.

    Counterexample: file with first 10 pages dirty, writeback_index == 5,
    nr_to_write == 10. Then the 5 last pages will be found, and scanned will
    be set to 1, after writing those out, we will not cycle back to get the
    first 5.

    Rework this logic, now we'll always cycle unless we started off from index
    0. When cycling, only write out as far as 1 page before the start page
    from the first cycle (so we don't write parts of the file twice).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
    identified a minor issue in fs/mpage.c

    As the comment above mpage_readpages() says, a fs's get_block function
    will set BH_Boundary when it maps a block just before a block for which
    extra I/O is required.

    Since get_block() can map a range of pages, for all these pages the
    BH_Boundary flag will be set. But we only need to push what I/O we have
    accumulated at the last block of this range.

    This makes do_mpage_readpage() send out the largest possible bio instead
    of a bunch of page-sized ones in the BH_Boundary case.

    Signed-off-by: Miquel van Smoorenburg
    Cc: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miquel van Smoorenburg
     
  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • zone_scan_mutex is actually a spinlock, so name it appropriately.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Following "mm: don't mark_page_accessed in fault path", which now
    places a mark_page_accessed() in zap_pte_range(), we should remove
    the mark_page_accessed() from shmem_fault().

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
    unmap-time if the pte is young has a number of problems.

    mark_page_accessed is supposed to be roughly the equivalent of a young pte
    for unmapped references. Unfortunately it doesn't come with any context:
    after being called, reclaim doesn't know who or why the page was touched.

    So calling mark_page_accessed not only adds extra lru or PG_referenced
    manipulations for pages that are already going to have pte_young ptes anyway,
    but it also adds these references which are difficult to work with from the
    context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
    wish to contribute to the page being referenced).

    Then, simply doing SetPageReferenced when zapping a pte and finding it is
    young, is not a really good solution either. SetPageReferenced does not
    correctly promote the page to the active list for example. So after removing
    mark_page_accessed from the fault path, several mmap()+touch+munmap() would
    have a very different result from several read(2) calls for example, which
    is not really desirable.

    Signed-off-by: Nick Piggin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
    kernel to back a VMA. This matches the size used by the MMU in the
    majority of cases. However, one counter-example occurs on PPC64 kernels
    whereby a kernel using 64K as a base pagesize may still use 4K pages for
    the MMU on older processor. To distinguish, this patch reports
    MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

    Signed-off-by: Mel Gorman
    Cc: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is useful to verify a hugepage-aware application is using the expected
    pagesizes for its memory regions. This patch creates an entry called
    KernelPageSize in /proc/pid/smaps that is the size of page used by the
    kernel to back a VMA. The entry is not called PageSize as it is possible
    the MMU uses a different size. This extension should not break any sensible
    parser that skips lines containing unrecognised information.

    Signed-off-by: Mel Gorman
    Acked-by: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Jan, 2009

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
    dm snapshot: extend exception store functions
    dm snapshot: split out exception store implementations
    dm snapshot: rename struct exception_store
    dm snapshot: separate out exception store interface
    dm mpath: move trigger_event to system workqueue
    dm: add name and uuid to sysfs
    dm table: rework reference counting
    dm: support barriers on simple devices
    dm request: extend target interface
    dm request: add caches
    dm ioctl: allow dm_copy_name_and_uuid to return only one field
    dm log: ensure log bitmap fits on log device
    dm log: move region_size validation
    dm log: avoid reinitialising io_req on every operation
    dm: consolidate target deregistration error handling
    dm raid1: fix error count
    dm log: fix dm_io_client leak on error paths
    dm snapshot: change yield to msleep
    dm table: drop reference at unbind

    Linus Torvalds
     
  • Supply dm_add_exception as a callback to the read_metadata function.
    Add a status function ready for a later patch and name the functions
    consistently.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Move the existing snapshot exception store implementations out into
    separate files. Later patches will place these behind a new
    interface in preparation for alternative implementations.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon