17 Oct, 2007

40 commits

  • Signed-off-by: Nick Piggin
    Acked-by: Anders Larsen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Tigran Aivazian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework the generic block "cont" routines to handle the new aops. Supporting
    cont_prepare_write would take quite a lot of code to support, so remove it
    instead (and we later convert all filesystems to use it).

    write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
    generic_cont_expand, so filesystems can avoid the old hacks they used.

    Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Cc: Nick Piggin
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Whitehouse
     
  • Signed-off-by: Nick Piggin
    Cc: David Chinner
    Cc: Timothy Shimmin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Convert ext4 to use write_begin()/write_end() methods.

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Nick Piggin
    Cc: Dmitriy Monakhov
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Various fixes and improvements

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Implement new aops for some of the simpler filesystems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
    path.

    This may be a pretty questionable gain in most cases, especially after the
    legacy 2copy write path is removed, but it doesn't cost much.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Partial write can be easily supported by LO_CRYPT_NONE mode, but it is not
    easy in LO_CRYPT_CRYPTOAPI case, because of its block nature. I don't know
    who still used cryptoapi, but theoretically it is possible. So let's leave
    things as they are. Loop device doesn't support partial write before
    Nick's "write_begin/write_end" patch set, and let's it behave the same way
    after.

    Signed-off-by: Dmitriy Monakhov
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Monakhov
     
  • These are intended to replace prepare_write and commit_write with more
    flexible alternatives that are also able to avoid the buffered write
    deadlock problems efficiently (which prepare_write is unable to do).

    [mark.fasheh@oracle.com: API design contributions, code review and fixes]
    [akpm@linux-foundation.org: various fixes]
    [dmonakhov@sw.ru: new aop block_write_begin fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Mark Fasheh
    Signed-off-by: Dmitriy Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • New buffers against uptodate pages are simply be marked uptodate, while the
    buffer_new bit remains set. This causes error-case code to zero out parts of
    those buffers because it thinks they contain stale data: wrong, they are
    actually uptodate so this is a data loss situation.

    Fix this by actually clearning buffer_new and marking the buffer dirty. It
    makes sense to always clear buffer_new before setting a buffer uptodate.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add an iterator data structure to operate over an iovec. Add usercopy
    operators needed by generic_file_buffered_write, and convert that function
    over.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Modify the core write() code so that it won't take a pagefault while holding a
    lock on the pagecache page. There are a number of different deadlocks possible
    if we try to do such a thing:

    1. generic_buffered_write
    2. lock_page
    3. prepare_write
    4. unlock_page+vmtruncate
    5. copy_from_user
    6. mmap_sem(r)
    7. handle_mm_fault
    8. lock_page (filemap_nopage)
    9. commit_write
    10. unlock_page

    a. sys_munmap / sys_mlock / others
    b. mmap_sem(w)
    c. make_pages_present
    d. get_user_pages
    e. handle_mm_fault
    f. lock_page (filemap_nopage)

    2,8 - recursive deadlock if page is same
    2,8;2,8 - ABBA deadlock is page is different
    2,6;b,f - ABBA deadlock if page is same

    The solution is as follows:
    1. If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

    1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

    2. If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

    (also, rename maxlen to seglen, because it was confusing)

    This increases the CPU/memory copy cost by almost 50% on the affected
    workloads. That will be solved by introducing a new set of pagecache write
    aops in a subsequent patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Hide some of the open-coded nr_segs tests into the iovec helpers. This is all
    to simplify generic_file_buffered_write, because that gets more complex in the
    next patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Quite a bit of code is used in maintaining these "cached pages" that are
    probably pretty unlikely to get used. It would require a narrow race where
    the page is inserted concurrently while this process is allocating a page
    in order to create the spare page. Then a multi-page write into an uncached
    part of the file, to make use of it.

    Next, the buffered write path (and others) uses its own LRU pagevec when it
    should be just using the per-CPU LRU pagevec (which will cut down on both data
    and code size cacheline footprint). Also, these private LRU pagevecs are
    emptied after just a very short time, in contrast with the per-CPU pagevecs
    that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
    to add the pages to pagecache for a bulk write (in 4K chunks).

    [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
    to clashes in -mm. What put them there, and why? ]

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
    we may have failed the write operation despite prepare_write having
    instantiated blocks past i_size. Fix this, and consolidate the trimming into
    one place.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
    Makes the race much easier to hit.

    This is useful for demonstration and testing purposes, but is removed in a
    subsequent patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rename some variables and fix some types.

    Signed-off-by: Andrew Morton
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This reverts commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which
    fixed the following bug:

    When prefaulting in the pages in generic_file_buffered_write(), we only
    faulted in the pages for the firts segment of the iovec. If the second of
    successive segment described a mmapping of the page into which we're
    write()ing, and that page is not up-to-date, the fault handler tries to lock
    the already-locked page (to bring it up to date) and deadlocks.

    An exploit for this bug is in writev-deadlock-demo.c, in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    (These demos assume blocksize < PAGE_CACHE_SIZE).

    The problem with this fix is that it takes the kernel back to doing a single
    prepare_write()/commit_write() per iovec segment. So in the worst case we'll
    run prepare_write+commit_write 1024 times where we previously would have run
    it once. The other problem with the fix is that it fix all the locking problems.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This reverts commit 81b0c8713385ce1b1b9058e916edcf9561ad76d6, which was
    a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 ("[PATCH]
    generic_file_buffered_write(): deadlock on vectored write"), which we
    also revert.

    Signed-off-by: Andrew Morton
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Revert the patch from Neil Brown to optimise NFSD writev handling.

    Cc: Neil Brown
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • While running some memory intensive load, system response deteriorated just
    after swap-out started.

    The cause of this problem is that when a PG_reclaim page is moved to the tail
    of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
    acquired every page writeback . This deteriorates system performance and
    makes interrupt hold off time longer when swap-out started.

    Following patch solves this problem. I use pagevec in rotating reclaimable
    pages to mitigate LRU spin lock contention and reduce interrupt hold off time.

    I did a test that allocating and touching pages in multiple processes, and
    pinging to the test machine in flooding mode to measure response under memory
    intensive load.

    The test result is:

    -2.6.23-rc5
    --- testmachine ping statistics ---
    3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
    rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
    17.746/0.092 ms

    -2.6.23-rc5-patched
    --- testmachine ping statistics ---
    3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
    rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
    17.314/0.091 ms

    Max round-trip-time was improved.

    The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
    8GB memory , 8GB swap.

    I did ping test again to observe performance deterioration caused by taking
    a ref.

    -2.6.23-rc6-with-modifiedpatch
    --- testmachine ping statistics ---
    3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
    rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms

    The result for my original patch is as follows.

    -2.6.23-rc5-with-originalpatch
    --- testmachine ping statistics ---
    3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
    rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms

    The influence to response was small.

    [akpm@linux-foundation.org: fix uninitalised var warning]
    [hugh@veritas.com: fix locking]
    [randy.dunlap@oracle.com: fix function declaration]
    [hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
    [hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
    [hugh@veritas.com: move_tail_pages into lru_add_drain]
    Signed-off-by: Hisashi Hifumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     
  • Allow an application to query the memories allowed by its context.

    Updated numa_memory_policy.txt to mention that applications can use this to
    obtain allowed memories for constructing valid policies.

    TODO: update out-of-tree libnuma wrapper[s], or maybe add a new
    wrapper--e.g., numa_get_mems_allowed() ?

    Also, update numa syscall man pages.

    Tested with memtoy V>=0.13.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The current VM can get itself into trouble fairly easily on systems with a
    small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.

    On one side, page_alloc() will allocate down to zone->pages_low, while on
    the other side, kswapd() and balance_pgdat() will try to free memory from
    every zone, until every zone has more free pages than zone->pages_high.

    Highmem can be filled up to zone->pages_low with page tables, ramfs,
    vmalloc allocations and other unswappable things quite easily and without
    many bad side effects, since we still have a huge ZONE_NORMAL to do future
    allocations from.

    However, as long as the number of free pages in the highmem zone is below
    zone->pages_high, kswapd will continue swapping things out from
    ZONE_NORMAL, too!

    Sami Farin managed to get his system into a stage where kswapd had freed
    about 700MB of low memory and was still "going strong".

    The attached patch will make kswapd stop paging out data from zones when
    there is more than enough memory free. We do go above zone->pages_high in
    order to keep pressure between zones equal in normal circumstances, but the
    patch should prevent the kind of excesses that made Sami's computer totally
    unusable.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • vmalloc() returns a void pointer, so there's no need to cast its
    return value in mm/page_alloc.c::zone_wait_table_init().

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • A while back, Nick Piggin introduced a patch to reduce the node memory
    usage for small files (commit cfd9b7df4abd3257c9e381b0e445817b26a51c0c):

    -#define RADIX_TREE_MAP_SHIFT 6
    +#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)

    Unfortunately, he didn't take into account the fact that the
    calculation of the maximum path was based on an assumption of having
    to round up:

    #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2)

    So, if CONFIG_BASE_SMALL is set, you will end up with a
    RADIX_TREE_MAX_PATH that is one greater than necessary. The practical
    upshot of this is just a bit of wasted memory (one long in the
    height_to_maxindex array, an extra pre-allocated radix tree node per
    cpu, and extra stack usage in a couple of functions), but it seems
    worth getting right.

    It's also worth noting that I never build with CONFIG_BASE_SMALL.
    What I did to test this was duplicate the code in a small user-space
    program and check the results of the calculations for max path and the
    contents of the height_to_maxindex array.

    Signed-off-by: Jeff Moyer
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • nobh mode error handling is not just pretty slack, it's wrong.

    One cannot zero out the whole page to ensure new blocks are zeroed, because
    it just brings the whole page "uptodate" with zeroes even if that may not
    be the correct uptodate data. Also, other parts of the page may already
    contain dirty data which would get lost by zeroing it out. Thirdly, the
    writeback of zeroes to the new blocks will also erase existing blocks. All
    these conditions are pagecache and/or filesystem corruption.

    The problem comes about because we didn't keep track of which buffers
    actually are new or old. However it is not enough just to keep only this
    state, because at the point we start dirtying parts of the page (new
    blocks, with zeroes), the handling of IO errors becomes impossible without
    buffers because the page may only be partially uptodate, in which case the
    page flags allone cannot capture the state of the parts of the page.

    So allocate all buffers for the page upfront, but leave them unattached so
    that they don't pick up any other references and can be freed when we're
    done. If the error path is hit, then zero the new buffers as the regular
    buffer path does, then attach the buffers to the page so that it can
    actually be written out correctly and be subject to the normal IO error
    handling paths.

    As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
    systems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Move duplicated code from end_buffer_read_XXX methods to separate helper
    function.

    Signed-off-by: Dmitry Monakhov
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Monakhov
     
  • A NULL pointer means that the object was not allocated. One cannot
    determine the size of an object that has not been allocated. Currently we
    return 0 but we really should BUG() on attempts to determine the size of
    something nonexistent.

    krealloc() interprets NULL to mean a zero sized object. Handle that
    separately in krealloc().

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
    PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
    PAGE_CACHE_SIZE. At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
    defined as PAGE_SHIFT, but should that ever change this calculation would
    break.

    Signed-off-by: Dean Nelson
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     
  • Considering kfree(NULL) would normally occur only in error paths and
    kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
    condition check in SLUB's and SLOB's kfree() to optimize for the common
    case. SLAB has this already.

    Signed-off-by: Satyam Sharma
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Satyam Sharma
     
  • Move the definitions of struct mm_struct and struct vma_area_struct to
    include/mm_types.h. This allows to define more function in asm/pgtable.h
    and friends with inline assemblies instead of macros. Compile tested on
    i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

    [aurelien@aurel32.net: build fix]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Aurelien Jarno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Rather than sign direct radix-tree pointers with a special bit, sign the
    indirect one that hangs off the root. This means that, given a lookup_slot
    operation, the invalid result will be differentiated from the valid
    (previously, valid results could have the bit either set or clear).

    This does not affect slot lookups which occur under lock -- they can never
    return an invalid result. Is needed in future for lockless pagecache.

    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin