07 Jan, 2009

31 commits

  • cgroup_mm_owner_callbacks() was brought in to support the memrlimit
    controller, but sneaked into mainline ahead of it. That controller has
    now been shelved, and the mm_owner_changed() args were inadequate for it
    anyway (they needed an mm pointer instead of a task pointer).

    Remove the dead code, and restore mm_update_next_owner() locking to how it
    was before: taking mmap_sem there does nothing for memcontrol.c, now the
    only user of mm->owner.

    Signed-off-by: Hugh Dickins
    Cc: Paul Menage
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It is known that buffer_mapped() is false in this code path.

    Signed-off-by: Franck Bui-Huu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Make the pte-level function in apply_to_range be called in lazy mmu mode,
    so that any pagetable modifications can be batched.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Venkatesh Pallipadi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Lazy unmapping in the vmalloc code has now opened the possibility for use
    after free bugs to go undetected. We can catch those by forcing an unmap
    and flush (which is going to be slow, but that's what happens).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The vmalloc purge lock can be a mutex so we can sleep while a purge is
    going on (purge involves a global kernel TLB invalidate, so it can take
    quite a while).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we do that, output of files like /proc/vmallocinfo will show things
    like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
    caller. This info is not as useful as the real caller of the allocation.

    So, proposal is to call __vmalloc_node node directly, with matching
    parameters to save the caller information

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • If we can't service a vmalloc allocation, show size of the allocation that
    actually failed. Useful for debugging.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • File pages mapped only in sequentially read mappings are perfect reclaim
    canditates.

    This patch makes these mappings behave like weak references, their pages
    will be reclaimed unless they have a strong reference from a normal
    mapping as well.

    It changes the reclaim and the unmap path where they check if the page has
    been referenced. In both cases, accesses through sequentially read
    mappings will be ignored.

    Benchmark results from KOSAKI Motohiro:

    http://marc.info/?l=linux-mm&m=122485301925098&w=2

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • speculative page references patch (commit:
    e286781d5f2e9c846e012a39653a166e9d31777d) removed last
    pagevec_release_nonlru() caller.

    So this function can be removed now.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Don't print the size of the zone's memmap array if it does not have one.

    Impact: cleanup

    Signed-off-by: Yinghai Lu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade
     
  • Chris Mason notices do_sync_mapping_range didn't actually ask for data
    integrity writeout. Unfortunately, it is advertised as being usable for
    data integrity operations.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Now that we have the early-termination logic in place, it makes sense to
    bail out early in all other cases where done is set to 1.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Terminate the write_cache_pages loop upon encountering the first page past
    end, without locking the page. Pages cannot have their index change when
    we have a reference on them (truncate, eg truncate_inode_pages_range
    performs the same check without the page lock).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if we get stuck behind another process that is
    cleaning pages, we will be forced to wait for them to finish, then perform
    our own writeout (if it was redirtied during the long wait), then wait for
    that.

    If a page under writeout is still clean, we can skip waiting for it (if
    we're part of a data integrity sync, we'll be waiting for all writeout
    pages afterwards, so we'll still be waiting for the other guy's write
    that's cleaned the page).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Get rid of some complex expressions from flow control statements, add a
    comment, remove some duplicate code.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if ret signals a real error, but we still have some
    pages left in the pagevec, done would be set to 1, but the remaining pages
    would continue to be processed and ret will be overwritten in the process.

    It could easily be overwritten with success, and thus success will be
    returned even if there is an error. Thus the caller is told all writes
    succeeded, wheras in reality some did not.

    Fix this by bailing immediately if there is an error, and retaining the
    first error code.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We'd like to break out of the loop early in many situations, however the
    existing code has been setting mapping->writeback_index past the final
    page in the pagevec lookup for cyclic writeback. This is a problem if we
    don't process all pages up to the final page.

    Currently the code mostly keeps writeback_index reasonable and hacked
    around this by not breaking out of the loop or writing pages outside the
    range in these cases. Keep track of a real "done index" that enables us
    to terminate the loop in a much more flexible manner.

    Needed by the subsequent patch to preserve writepage errors, and then
    further patches to break out of the loop early for other reasons. However
    there are no functional changes with this patch alone.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, scanned == 1 is supposed to mean that cyclic
    writeback has circled through zero, thus we should not circle again.
    However it gets set to 1 after the first successful pagevec lookup. This
    leads to cases where not enough data gets written.

    Counterexample: file with first 10 pages dirty, writeback_index == 5,
    nr_to_write == 10. Then the 5 last pages will be found, and scanned will
    be set to 1, after writing those out, we will not cycle back to get the
    first 5.

    Rework this logic, now we'll always cycle unless we started off from index
    0. When cycling, only write out as far as 1 page before the start page
    from the first cycle (so we don't write parts of the file twice).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
    identified a minor issue in fs/mpage.c

    As the comment above mpage_readpages() says, a fs's get_block function
    will set BH_Boundary when it maps a block just before a block for which
    extra I/O is required.

    Since get_block() can map a range of pages, for all these pages the
    BH_Boundary flag will be set. But we only need to push what I/O we have
    accumulated at the last block of this range.

    This makes do_mpage_readpage() send out the largest possible bio instead
    of a bunch of page-sized ones in the BH_Boundary case.

    Signed-off-by: Miquel van Smoorenburg
    Cc: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miquel van Smoorenburg
     
  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • zone_scan_mutex is actually a spinlock, so name it appropriately.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Following "mm: don't mark_page_accessed in fault path", which now
    places a mark_page_accessed() in zap_pte_range(), we should remove
    the mark_page_accessed() from shmem_fault().

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
    unmap-time if the pte is young has a number of problems.

    mark_page_accessed is supposed to be roughly the equivalent of a young pte
    for unmapped references. Unfortunately it doesn't come with any context:
    after being called, reclaim doesn't know who or why the page was touched.

    So calling mark_page_accessed not only adds extra lru or PG_referenced
    manipulations for pages that are already going to have pte_young ptes anyway,
    but it also adds these references which are difficult to work with from the
    context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
    wish to contribute to the page being referenced).

    Then, simply doing SetPageReferenced when zapping a pte and finding it is
    young, is not a really good solution either. SetPageReferenced does not
    correctly promote the page to the active list for example. So after removing
    mark_page_accessed from the fault path, several mmap()+touch+munmap() would
    have a very different result from several read(2) calls for example, which
    is not really desirable.

    Signed-off-by: Nick Piggin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
    kernel to back a VMA. This matches the size used by the MMU in the
    majority of cases. However, one counter-example occurs on PPC64 kernels
    whereby a kernel using 64K as a base pagesize may still use 4K pages for
    the MMU on older processor. To distinguish, this patch reports
    MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

    Signed-off-by: Mel Gorman
    Cc: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is useful to verify a hugepage-aware application is using the expected
    pagesizes for its memory regions. This patch creates an entry called
    KernelPageSize in /proc/pid/smaps that is the size of page used by the
    kernel to back a VMA. The entry is not called PageSize as it is possible
    the MMU uses a different size. This extension should not break any sensible
    parser that skips lines containing unrecognised information.

    Signed-off-by: Mel Gorman
    Acked-by: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Jan, 2009

9 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
    dm snapshot: extend exception store functions
    dm snapshot: split out exception store implementations
    dm snapshot: rename struct exception_store
    dm snapshot: separate out exception store interface
    dm mpath: move trigger_event to system workqueue
    dm: add name and uuid to sysfs
    dm table: rework reference counting
    dm: support barriers on simple devices
    dm request: extend target interface
    dm request: add caches
    dm ioctl: allow dm_copy_name_and_uuid to return only one field
    dm log: ensure log bitmap fits on log device
    dm log: move region_size validation
    dm log: avoid reinitialising io_req on every operation
    dm: consolidate target deregistration error handling
    dm raid1: fix error count
    dm log: fix dm_io_client leak on error paths
    dm snapshot: change yield to msleep
    dm table: drop reference at unbind

    Linus Torvalds
     
  • Supply dm_add_exception as a callback to the read_metadata function.
    Add a status function ready for a later patch and name the functions
    consistently.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Move the existing snapshot exception store implementations out into
    separate files. Later patches will place these behind a new
    interface in preparation for alternative implementations.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Rename struct exception_store to dm_exception_store.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Pull structures that bridge the gap between snapshot and
    exception store out of dm-snap.h and put them in a new
    .h file - dm-exception-store.h. This file will define the
    API for new exception stores.

    Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
    should be using it.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • The same workqueue is used both for sending uevents and processing queued I/O.
    Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
    for the queued I/O to be processed. Use scheduled_work() for the asynchronous
    uevents instead.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Implement simple read-only sysfs entry for device-mapper block device.

    This patch adds a simple sysfs directory named "dm" under block device
    properties and implements
    - name attribute (string containing mapped device name)
    - uuid attribute (string containing UUID, or empty string if not set)

    The kobject is embedded in mapped_device struct, so no additional
    memory allocation is needed for initializing sysfs entry.

    During the processing of sysfs attribute we need to lock mapped device
    which is done by a new function dm_get_from_kobj, which returns the md
    associated with kobject and increases the usage count.

    Each 'show attribute' function is responsible for its own locking.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Rework table reference counting.

    The existing code uses a reference counter. When the last reference is
    dropped and the counter reaches zero, the table destructor is called.
    Table reference counters are acquired/released from upcalls from other
    kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
    If the reference counter reaches zero in one of the upcalls, the table
    destructor is called from almost random kernel code.

    This leads to various problems:
    * dm_any_congested being called under a spinlock, which calls the
    destructor, which calls some sleeping function.
    * the destructor attempting to take a lock that is already taken by the
    same process.
    * stale reference from some other kernel code keeps the table
    constructed, which keeps some devices open, even after successful
    return from "dmsetup remove". This can confuse lvm and prevent closing
    of underlying devices or reusing device minor numbers.

    The patch changes reference counting so that the table destructor can be
    called only at predetermined places.

    The table has always exactly one reference from either mapped_device->map
    or hash_cell->new_map. After this patch, this reference is not counted
    in table->holders. A pair of dm_create_table/dm_destroy_table functions
    is used for table creation/destruction.

    Temporary references from the other code increase table->holders. A pair
    of dm_table_get/dm_table_put functions is used to manipulate it.

    When the table is about to be destroyed, we wait for table->holders to
    reach 0. Then, we call the table destructor. We use active waiting with
    msleep(1), because the situation happens rarely (to one user in 5 years)
    and removing the device isn't performance-critical task: the user doesn't
    care if it takes one tick more or not.

    This way, the destructor is called only at specific points
    (dm_table_destroy function) and the above problems associated with lazy
    destruction can't happen.

    Finally remove the temporary protection added to dm_any_congested().

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Implement barrier support for single device DM devices

    This patch implements barrier support in DM for the common case of dm linear
    just remapping a single underlying device. In this case we can safely
    pass the barrier through because there can be no reordering between
    devices.

    NB. Any DM device might cease to support barriers if it gets
    reconfigured so code must continue to allow for a possible
    -EOPNOTSUPP on every barrier bio submitted. - agk

    Signed-off-by: Andi Kleen
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Andi Kleen