17 Oct, 2007

40 commits

  • prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
    GFS2 were converted to the new aops, so we can make some simplifications
    for that.

    [michal.k.k.piotrowski@gmail.com: fix warning]
    Signed-off-by: Nick Piggin
    Cc: Michael Halcrow
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is
    now implemented in a way that does not create blocks in sparse regions,
    which is a silly thing for it to have been doing (isn't it?)

    ext2 survives fsx and fsstress. jfs is converted as well... ext3
    should be easy to do (but not done yet).

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Plug ocfs2 into the ->write_begin and ->write_end aops.

    A bunch of custom code is now gone - the iovec iteration stuff during write
    and the ocfs2 splice write actor.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Cc: Roman Zippel
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Acked-by: Russell King
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Acked-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Andries Brouwer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Convert udf to new aops. Also seem to have fixed pagecache corruption in
    udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
    fixed the silly setup where prepare_write was doing a kmap to be used in
    commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
    this easier.

    Signed-off-by: Nick Piggin
    Cc:
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This also gets rid of a lot of useless read_file stuff. And also
    optimises the full page write case by marking a !uptodate page uptodate.

    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • [mszeredi]
    - don't send zero length write requests
    - it is not legal for the filesystem to return with zero written bytes

    Signed-off-by: Nick Piggin
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • [akpm@linux-foundation.org: fix against git-nfs]
    [peterz@infradead.org: fix against git-nfs]
    Signed-off-by: Nick Piggin
    Acked-by: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
    in order to get rid of the special generic_cont_expand routine

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Convert reiserfs to new aops

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Make reiserfs to write via generic routines.
    Original reiserfs write optimized for big writes is deadlock rone

    Signed-off-by: Vladimir Saveliev
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Saveliev
     
  • Signed-off-by: Nick Piggin
    Acked-by: Anders Larsen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Tigran Aivazian
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework the generic block "cont" routines to handle the new aops. Supporting
    cont_prepare_write would take quite a lot of code to support, so remove it
    instead (and we later convert all filesystems to use it).

    write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
    generic_cont_expand, so filesystems can avoid the old hacks they used.

    Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Cc: Nick Piggin
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Whitehouse
     
  • Signed-off-by: Nick Piggin
    Cc: David Chinner
    Cc: Timothy Shimmin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Convert ext4 to use write_begin()/write_end() methods.

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Nick Piggin
    Cc: Dmitriy Monakhov
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Various fixes and improvements

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Implement new aops for some of the simpler filesystems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • These are intended to replace prepare_write and commit_write with more
    flexible alternatives that are also able to avoid the buffered write
    deadlock problems efficiently (which prepare_write is unable to do).

    [mark.fasheh@oracle.com: API design contributions, code review and fixes]
    [akpm@linux-foundation.org: various fixes]
    [dmonakhov@sw.ru: new aop block_write_begin fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Mark Fasheh
    Signed-off-by: Dmitriy Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • New buffers against uptodate pages are simply be marked uptodate, while the
    buffer_new bit remains set. This causes error-case code to zero out parts of
    those buffers because it thinks they contain stale data: wrong, they are
    actually uptodate so this is a data loss situation.

    Fix this by actually clearning buffer_new and marking the buffer dirty. It
    makes sense to always clear buffer_new before setting a buffer uptodate.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Quite a bit of code is used in maintaining these "cached pages" that are
    probably pretty unlikely to get used. It would require a narrow race where
    the page is inserted concurrently while this process is allocating a page
    in order to create the spare page. Then a multi-page write into an uncached
    part of the file, to make use of it.

    Next, the buffered write path (and others) uses its own LRU pagevec when it
    should be just using the per-CPU LRU pagevec (which will cut down on both data
    and code size cacheline footprint). Also, these private LRU pagevecs are
    emptied after just a very short time, in contrast with the per-CPU pagevecs
    that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
    to add the pages to pagecache for a bulk write (in 4K chunks).

    [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
    to clashes in -mm. What put them there, and why? ]

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • nobh mode error handling is not just pretty slack, it's wrong.

    One cannot zero out the whole page to ensure new blocks are zeroed, because
    it just brings the whole page "uptodate" with zeroes even if that may not
    be the correct uptodate data. Also, other parts of the page may already
    contain dirty data which would get lost by zeroing it out. Thirdly, the
    writeback of zeroes to the new blocks will also erase existing blocks. All
    these conditions are pagecache and/or filesystem corruption.

    The problem comes about because we didn't keep track of which buffers
    actually are new or old. However it is not enough just to keep only this
    state, because at the point we start dirtying parts of the page (new
    blocks, with zeroes), the handling of IO errors becomes impossible without
    buffers because the page may only be partially uptodate, in which case the
    page flags allone cannot capture the state of the parts of the page.

    So allocate all buffers for the page upfront, but leave them unattached so
    that they don't pick up any other references and can be freed when we're
    done. If the error path is hit, then zero the new buffers as the regular
    buffer path does, then attach the buffers to the page so that it can
    actually be written out correctly and be subject to the normal IO error
    handling paths.

    As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
    systems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Move duplicated code from end_buffer_read_XXX methods to separate helper
    function.

    Signed-off-by: Dmitry Monakhov
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Monakhov
     
  • The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
    (and thus mapcounted and count towards shared rss). These writes to
    the struct page could cause excessive cacheline bouncing on big
    systems. There are a number of ways this could be addressed if it is
    an issue.

    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).

    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely

    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.

    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).

    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.

    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.

    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.

    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
    much use to them except on benchmarks. All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus "snif" Torvalds

    Nick Piggin
     
  • Combine the file_ra_state members
    unsigned long prev_index
    unsigned int prev_offset
    into
    loff_t prev_pos

    It is more consistent and better supports huge files.

    Thanks to Peter for the nice proposal!

    [akpm@linux-foundation.org: fix shift overflow]
    Cc: Peter Zijlstra
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu