24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

10 Sep, 2013

1 commit

  • Not using the return value can in the generic case be racy, so it's
    in general good practice to check the return value instead.

    This also resolved the warning caused on ARM and other architectures:

    fs/direct-io.c: In function 'sb_init_dio_done_wq':
    fs/direct-io.c:557:2: warning: value computed is not used [-Wunused-value]

    Signed-off-by: Olof Johansson
    Reviewed-by: Jan Kara
    Cc: Geert Uytterhoeven
    Cc: Stephen Rothwell
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Russell King
    Cc: H Peter Anvin
    Signed-off-by: Linus Torvalds

    Olof Johansson
     

04 Sep, 2013

2 commits

  • Call generic_write_sync() from the deferred I/O completion handler if
    O_DSYNC is set for a write request. Also make sure various callers
    don't call generic_write_sync if the direct I/O code returns
    -EIOCBQUEUED.

    Based on an earlier patch from Jan Kara with updates from
    Jeff Moyer and Darrick J. Wong .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add support to the core direct-io code to defer AIO completions to user
    context using a workqueue. This replaces opencoded and less efficient
    code in XFS and ext4 (we save a memory allocation for each direct IO)
    and will be needed to properly support O_(D)SYNC for AIO.

    The communication between the filesystem and the direct I/O code requires
    a new buffer head flag, which is a bit ugly but not avoidable until the
    direct I/O code stops abusing the buffer_head structure for communicating
    with the filesystems.

    Currently this creates a per-superblock unbound workqueue for these
    completions, which is taken from an earlier patch by Jan Kara. I'm
    not really convinced about this use and would prefer a "normal" global
    workqueue with a high concurrency limit, but this needs further discussion.

    JK: Fixed ext4 part, dynamic allocation of the workqueue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

30 Apr, 2013

2 commits

  • Currently, dio_send_cur_page() submits bio before current page and cached
    sdio->cur_page is added to the bio if sdio->boundary is set. This is
    actually wrong because sdio->boundary means the current buffer is the last
    one before metadata needs to be read. So we should rather submit the bio
    after the current page is added to it.

    Signed-off-by: Jan Kara
    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • When we read/write a file sequentially, we will read/write not only the
    data blocks but also the indirect blocks that may not be physically
    adjacent to the data blocks. So filesystems set the BH_Boundary flag to
    submit the previous I/O before reading/writing an indirect block.

    However the generic direct IO code mishandles buffer_boundary(), setting
    sdio->boundary before each submit_page_section() call which results in
    sending only one page bios as underlying code thinks this page is the last
    in the contiguous extent. So fix the problem by setting sdio->boundary
    only if the current page is really the last one in the mapped extent.

    With this patch and "direct-io: submit bio after boundary buffer is added
    to it" I've measured about 10% throughput improvement of direct IO reads
    on ext3 with SATA harddrive (from 90 MB/s to 100 MB/s). With ramdisk, the
    improvement was about 3-fold (from 350 MB/s to 1.2 GB/s). For other
    filesystems (such as ext4), the improvements won't be as visible because
    the frequency of BH_Boundary flag being set is much smaller.

    Signed-off-by: Jan Kara
    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

24 Mar, 2013

1 commit

  • More prep work for immutable bvecs:

    A few places in the code were either open coding or using the wrong
    version - fix.

    After we introduce the bvec iter, it'll no longer be possible to modify
    the biovec through bio_for_each_segment_all() - it doesn't increment a
    pointer to the current bvec, you pass in a struct bio_vec (not a
    pointer) which is updated with what the current biovec would be (taking
    into account bi_bvec_done and bi_size).

    So because of that it's more worthwhile to be consistent about
    bio_for_each_segment()/bio_for_each_segment_all() usage.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Alexander Viro

    Kent Overstreet
     

23 Feb, 2013

1 commit

  • Running AIO is pinning inode in memory using file reference. Once AIO
    is completed using aio_complete(), file reference is put and inode can
    be freed from memory. So we have to be sure that calling aio_complete()
    is the last thing we do with the inode.

    CC: Christoph Hellwig
    CC: Jens Axboe
    CC: Jeff Moyer
    CC: stable@vger.kernel.org
    Acked-by: Jeff Moyer
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

30 Nov, 2012

1 commit

  • Since directio can work on a raw block device, and the block size of the
    device can change under it, we need to do the same thing that
    fs/buffer.c now does: read the block size a single time, using
    ACCESS_ONCE().

    Reading it multiple times can get different results, which will then
    confuse the code because it actually encodes the i_blksize in
    relationship to the underlying logical blocksize.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Aug, 2012

1 commit

  • Move unplugging for direct I/O from around ->direct_IO() down to
    do_blockdev_direct_IO(). This implicitly adds plugging for direct
    writes.

    CC: Li Shaohua
    Acked-by: Jeff Moyer
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Fengguang Wu
     

14 Jul, 2012

1 commit

  • READ is 0, so the result of the bit-and operation is 0. Rewrite with == as
    done elsewhere in the same file.

    This problem was found using Coccinelle (http://coccinelle.lip6.fr/).

    Signed-off-by: Julia Lawall
    Reviewed-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Julia Lawall
     

01 Jun, 2012

1 commit


24 Feb, 2012

1 commit

  • With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
    calls (namely inode_dio_wait() and inode_dio_done()) which are
    EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
    further inode_dio_wait() was pushed from notify_change() into the file
    system ->setattr() method but no non-GPL file system can make this call.

    That means non-GPL file systems cannot exist any more unless they do not
    use any VFS functionality related to reading/writing as far as I can
    tell or at least as long as they want to implement direct i/o.

    Both Linus and Al (and others) have said on LKML that this breakage of
    the VFS API should not have happened and that the change was simply
    missed as it was not documented in the change logs of the patches that
    did those changes.

    This patch changes the two function exports in question to be
    EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
    for all modules.

    Christoph, who introduced the two functions and exported them GPL-only
    is CC-ed on this patch to give him the opportunity to object to the
    symbols being changed in this manner if he did indeed intend them to be
    GPL-only and does not want them to become available to all modules.

    Signed-off-by: Anton Altaparmakov
    CC: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     

13 Jan, 2012

2 commits

  • Some investigation of a transaction processing workload showed that a
    major consumer of cycles in __blockdev_direct_IO is the cache miss while
    accessing the block size. This is because it has to walk the chain from
    block_dev to gendisk to queue.

    The block size is needed early on to check alignment and sizes. It's only
    done if the check for the inode block size fails. But the costly block
    device state is unconditionally fetched.

    - Reorganize the code to only fetch block dev state when actually
    needed.

    Then do a prefetch on the block dev early on in the direct IO path. This
    is worth it, because there is substantial code run before we actually
    touch the block dev now.

    - I also added some unlikelies to make it clear the compiler that block
    device fetch code is not normally executed.

    This gave a small, but measurable improvement on a large database
    benchmark (about 0.3%)

    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
    Signed-off-by: Andi Kleen
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • In get_more_blocks(), we use dio_count to calcuate fs_count and do some
    tricky things to increase fs_count if dio_count isn't aligned. But
    actually it still has some corner cases that can't be coverd. See the
    following example:

    dio_write foo -s 1024 -w 4096

    (direct write 4096 bytes at offset 1024). The same goes if the offset
    isn't aligned to fs_blocksize.

    In this case, the old calculation counts fs_count to be 1, but actually we
    will write into 2 different blocks (if fs_blocksize=4096). The old code
    just works, since it will call get_block twice (and may have to allocate
    and create extents twice for filesystems like ext4). So we'd better call
    get_block just once with the proper fs_count.

    Signed-off-by: Tao Ma
    Cc: "Theodore Ts'o"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     

28 Oct, 2011

7 commits

  • This doesn't change anything for the compiler, but hch thought it would
    make the code clearer.

    I moved the reference counting into its own little inline.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Add inlines to all the submission path functions. While this increases
    code size it also gives gcc a lot of optimization opportunities
    in this critical hotpath.

    In particular -- together with some other changes -- this
    allows gcc to get rid of the unnecessary clearing of
    sdio at the beginning and optimize the messy parameter passing.
    Any non inlining of a function which takes a sdio parameter
    would break this optimization because they cannot be done if the
    address of a structure is taken.

    Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
    and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

    This gives about 2.2% improvement on a large database benchmark
    with a high IOPS rate.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Only a single b_private field in the map_bh buffer head is needed after
    the submission path. Move map_bh separately to avoid storing
    this information in the long term slab.

    This avoids the weird 104 byte hole in struct dio_submit which also needed
    to be memseted early.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • A direct slab call is slightly faster than kmalloc and can be better cached
    per CPU. It also avoids rounding to the next kmalloc slab.

    In addition this enforces cache line alignment for struct dio to avoid
    any false sharing.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Fix most problems reported by pahole.

    There is still a weird 104 byte hole after map_bh. I'm not sure what
    causes this.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • There's nothing on the stack, even before my changes.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This large, but largely mechanic, patch moves all fields in struct dio
    that are only used in the submission path into a separate on stack
    data structure. This has the advantage that the memory is very likely
    cache hot, which is not guaranteed for memory fresh out of kmalloc.

    This also gives gcc more optimization potential because it can easier
    determine that there are no external aliases for these variables.

    The sdio initialization is a initialization now instead of memset.
    This allows gcc to break sdio into individual fields and optimize
    away unnecessary zeroing (after all the functions are inlined)

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

21 Jul, 2011

4 commits

  • For filesystems that delay their end_io processing we should keep our
    i_dio_count until the the processing is done. Enable this by moving
    the inode_dio_done call to the end_io handler if one exist. Note that
    the actual move to the workqueue for ext4 and XFS is not done in
    this patch yet, but left to the filesystem maintainers. At least
    for XFS it's not needed yet either as XFS has an internal equivalent
    to i_dio_count.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
    This these filesystems to also protect truncate against direct I/O requests
    by using common code. Right now the only non-DIO_LOCKING filesystem that
    appears to do so is XFS, which uses an opencoded variant of the i_dio_count
    scheme.

    Behaviour doesn't change for filesystems never calling inode_dio_wait.
    For ext4 behaviour changes when using the dioread_nonlock option, which
    previously was missing any protection between truncate and direct I/O reads.
    For ocfs2 that handcrafted i_dio_count manipulations are replaced with
    the common code now enable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Reject zero sized reads as soon as we know our I/O length, and don't
    borther with locks or allocations that might have to be cleaned up
    otherwise.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

10 Mar, 2011

2 commits

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Feb, 2011

1 commit


21 Jan, 2011

1 commit

  • When using devices that support max_segments > BIO_MAX_PAGES (256), direct
    IO tries to allocate a bio with more pages than allowed, which leads to an
    oops in dio_bio_alloc(). Clamp the request to the supported maximum, and
    change dio_bio_alloc() to reflect that bio_alloc() will always return a
    bio when called with __GFP_WAIT and a valid number of vectors.

    [akpm@linux-foundation.org: remove redundant BUG_ON()]
    Signed-off-by: David Dillow
    Reviewed-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Dillow
     

19 Jan, 2011

1 commit


27 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests)
    introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
    to the block layer. The problem is that the code expected
    dio->block_in_file to correspond to the current page in the dio. In fact,
    it corresponds to the previous page submitted via submit_page_section.
    This was purely an oversight, as the dio->cur_page_fs_offset field was
    introduced for just this purpose. This patch simply uses the correct
    variable when calculating whether there is a mismatch between contiguous
    logical blocks and contiguous physical blocks (as described in the
    comments).

    I also switched the if conditional following this check to an else if, to
    ensure that we never call dio_bio_submit twice for the same dio (in
    theory, this should not happen, anyway).

    I've tested this by running blktrace and verifying that a 64KB I/O was
    submitted as a single I/O. I also ran the patched kernel through
    xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
    and verified that there were no regressions as compared to an unpatched
    kernel.

    Signed-off-by: Jeff Moyer
    Acked-by: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

10 Aug, 2010

1 commit

  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

27 Jul, 2010

1 commit

  • Filesystems with unwritten extent support must not complete an AIO request
    until the transaction to convert the extent has been commited. That means
    the aio_complete calls needs to be moved into the ->end_io callback so
    that the filesystem can control when to call it exactly.

    This makes a bit of a mess out of dio_complete and the ->end_io callback
    prototype even more complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

28 May, 2010

1 commit

  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de