25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

10 Mar, 2011

2 commits

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Feb, 2011

1 commit


21 Jan, 2011

1 commit

  • When using devices that support max_segments > BIO_MAX_PAGES (256), direct
    IO tries to allocate a bio with more pages than allowed, which leads to an
    oops in dio_bio_alloc(). Clamp the request to the supported maximum, and
    change dio_bio_alloc() to reflect that bio_alloc() will always return a
    bio when called with __GFP_WAIT and a valid number of vectors.

    [akpm@linux-foundation.org: remove redundant BUG_ON()]
    Signed-off-by: David Dillow
    Reviewed-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Dillow
     

19 Jan, 2011

1 commit


27 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests)
    introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
    to the block layer. The problem is that the code expected
    dio->block_in_file to correspond to the current page in the dio. In fact,
    it corresponds to the previous page submitted via submit_page_section.
    This was purely an oversight, as the dio->cur_page_fs_offset field was
    introduced for just this purpose. This patch simply uses the correct
    variable when calculating whether there is a mismatch between contiguous
    logical blocks and contiguous physical blocks (as described in the
    comments).

    I also switched the if conditional following this check to an else if, to
    ensure that we never call dio_bio_submit twice for the same dio (in
    theory, this should not happen, anyway).

    I've tested this by running blktrace and verifying that a 64KB I/O was
    submitted as a single I/O. I also ran the patched kernel through
    xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
    and verified that there were no regressions as compared to an unpatched
    kernel.

    Signed-off-by: Jeff Moyer
    Acked-by: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

10 Aug, 2010

1 commit

  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

27 Jul, 2010

1 commit

  • Filesystems with unwritten extent support must not complete an AIO request
    until the transaction to convert the extent has been commited. That means
    the aio_complete calls needs to be moved into the ->end_io callback so
    that the filesystem can control when to call it exactly.

    This makes a bit of a mess out of dio_complete and the ->end_io callback
    prototype even more complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

28 May, 2010

1 commit

  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

25 May, 2010

2 commits

  • Btrfs cannot handle having logically non-contiguous requests submitted. For
    example if you have

    Logical: [0-4095][HOLE][8192-12287]
    Physical: [0-4095] [4096-8191]

    Normally the DIO code would put these into the same BIO's. The problem is we
    need to know exactly what offset is associated with what BIO so we can do our
    checksumming and unlocking properly, so putting them in the same BIO doesn't
    work. So add another check where we submit the current BIO if the physical
    blocks are not contigous OR the logical blocks are not contiguous.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Because BTRFS can do RAID and such, we need our own submit hook so we can setup
    the bio's in the correct fashion, and handle checksum errors properly. So there
    are a few changes here

    1) The submit_io hook. This is straightforward, just call this instead of
    submit_bio.

    2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for
    writes, since writes can fallback onto buffered IO. But BTRFS needs the option
    of falling back on buffered IO if it encounters a compressed extent, since we
    need to read the entire extent in and decompress it. So if we get -ENOTBLK back
    from get_block we'll return back and fallback on buffered just like the write
    case.

    I've tested these changes with fsx and everything seems to work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

17 Dec, 2009

1 commit


16 Dec, 2009

2 commits

  • Currently the locking in blockdev_direct_IO is a mess, we have three
    different locking types and very confusing checks for some of them. The
    most complicated one is DIO_OWN_LOCKING for reads, which happens to not
    actually be used.

    This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
    case is unused anyway, and the write side is almost identical to
    DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the
    create argument for the get_blocks callback to zero, but we can easily
    move that to the actual get_blocks callbacks. There are four users of the
    DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
    fine with the new version, ocfs2 only errors out if create were ever set,
    and we can remove this dead code now, the block device code only ever uses
    create for an error message if we are fully beyond the device which can
    never happen, and last but not least XFS will need the new behavour for
    writes.

    Now we can replace the lock_type variable with a flags one, where no flag
    means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
    flag. Separate out the check for not allowing to fill holes into a
    separate flag, although for now both flags always get set at the same
    time.

    Also revamp the documentation of the locking scheme to actually make
    sense.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Badari Pulavarty
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Cc: Zach Brown
    Cc: Al Viro
    Cc: Alex Elder
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Intel reported a performance regression caused by the following commit:

    commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f
    Author: Zach Brown
    Date: Mon Aug 20 17:12:01 2007 -0700

    dio: zero struct dio with kzalloc instead of manually

    This patch uses kzalloc to zero all of struct dio rather than
    manually trying to track which fields we rely on being zero. It
    passed aio+dio stress testing and some bug regression testing on
    ext3.

    This patch was introduced by Linus in the conversation that lead up
    to Badari's minimal fix to manually zero .map_bh.b_state in commit:

    6a648fa72161d1f6468dabd96c5d3c0db04f598a

    It makes the code a bit smaller. Maybe a couple fewer cachelines to
    load, if we're lucky:

    text data bss dec hex filename
    3285925 568506 1304616 5159047 4eb887 vmlinux
    3285797 568506 1304616 5158919 4eb807 vmlinux.patched

    I was unable to measure a stable difference in the number of cpu
    cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
    at ~340MB/s.

    So the resulting intent of the patch isn't a performance gain but to
    avoid exposing ourselves to the risk of finding another field like
    .map_bh.b_state where we rely on zeroing but don't enforce it in the
    code.

    Zach surmised that zeroing out the page array was what caused most of
    the problem, and suggested the approach taken in the attached patch for
    resolving the issue. Intel re-tested with this patch and saw a 0.6%
    performance gain (the original regression was 0.5%).

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Jeff Moyer
    Acked-by: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

26 Nov, 2009

1 commit

  • There seems to be a regression in direct write path due to following
    commit in for-2.6.33 branch of block tree.

    commit 1af60fbd759d31f565552fea315c2033947cfbe6
    Author: Jeff Moyer
    Date: Fri Oct 2 18:56:53 2009 -0400

    block: get rid of the WRITE_ODIRECT flag

    Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
    the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
    more request from the queue and not idle on it (despite the fact that
    queue's think time is less and it is not seeky).

    So direct writers lose big time when competing with sequential readers.

    Using fio, I have run one direct writer and two sequential readers and
    following are the results with 2.6.32-rc7 kernel and with for-2.6.33
    branch.

    Test
    ====
    1 direct writer and 2 sequential reader running simultaneously.

    [global]
    directory=/mnt/sdc/fio/
    runtime=10

    [seqwrite]
    rw=write
    size=4G
    direct=1

    [seqread]
    rw=read
    size=2G
    numjobs=2

    2.6.32-rc7
    ==========
    direct writes: aggrb=2,968KB/s
    readers : aggrb=101MB/s

    for-2.6.33 branch
    =================
    direct write: aggrb=19KB/s
    readers aggrb=137MB/s

    This patch brings back the WRITE_ODIRECT flag, with the difference that we
    don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
    submission of request and an explicit unplug from submitter is required.

    That way we fix the jeff's issue of not enough merging taking place in aio
    path as well as make sure direct writes get their fair share.

    After the fix
    =============
    for-2.6.33 + fix
    ----------------
    direct writes: aggrb=2,728KB/s
    reads: aggrb=103MB/s

    Thanks
    Vivek

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

28 Oct, 2009

2 commits

  • Hi,

    Some workloads issue batches of small I/O, and the performance is poor
    due to the call to blk_run_address_space for every single iocb. Nathan
    Roberts pointed this out, and suggested that by deferring this call
    until all I/Os in the iocb array are submitted to the block layer, we
    can realize some impressive performance gains (up to 30% for sequential
    4k reads in batches of 16).

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Hi,

    The WRITE_ODIRECT flag is only used in one place, and that code path
    happens to also call blk_run_address_space. The introduction of this
    flag, then, could result in the device being unplugged twice for every
    I/O.

    Further, with the batching changes in the next patch, we don't want an
    O_DIRECT write to imply a queue unplug.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

15 Apr, 2009

1 commit


06 Apr, 2009

1 commit

  • By default, CFQ will anticipate more IO from a given io context if the
    previously completed IO was sync. This used to be fine, since the only
    sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
    being used now, we don't want to anticipate for those.

    Add a bio/request flag that informs the IO scheduler that this is a sync
    request that we should not idle for. Introduce WRITE_ODIRECT specifically
    for O_DIRECT writes, and make sure that the other sync writes set this
    flag.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

07 Jan, 2009

1 commit

  • In case of error extending write may have instantiated a few blocks
    outside i_size. We need to trim these blocks. We have to do it
    *regardless* to blocksize. At least ext2, ext3 and reiserfs interpret
    (i_size < biggest block) condition as error. Fsck will complain about
    wrong i_size. Then fsck will fix the error by changing i_size according
    to the biggest block. This is bad because this blocks contain garbage
    from previous write attempt. And result in data corruption.

    ####TESTCASE_BEGIN
    $touch /mnt/test/BIG_FILE
    ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
    open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
    write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
    ## size and block sould't be changed because write op failed.
    $stat /mnt/test/BIG_FILE
    File: `/mnt/test/BIG_FILE'
    Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
    <<<<<<<? yes
    Pass 2: Checking directory structure
    ....
    #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
    index af0558d..4e88bea 100644

    [akpm@linux-foundation.org: use i_size_read()]
    Signed-off-by: Dmitri Monakhov
    Cc: Zach Brown
    Cc: Nick Piggin
    Cc: Badari Pulavarty
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitri Monakhov
     

17 Oct, 2008

1 commit


27 Jul, 2008

1 commit

  • Use get_user_pages_fast in the common/generic block and fs direct IO paths.

    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Feb, 2008

1 commit

  • Simplify page cache zeroing of segments of pages through 3 functions

    zero_user_segments(page, start1, end1, start2, end2)

    Zeros two segments of the page. It takes the position where to
    start and end the zeroing which avoids length calculations and
    makes code clearer.

    zero_user_segment(page, start, end)

    Same for a single segment.

    zero_user(page, start, length)

    Length variant for the case where we know the length.

    We remove the zero_user_page macro. Issues:

    1. Its a macro. Inline functions are preferable.

    2. The KM_USER0 macro is only defined for HIGHMEM.

    Having to treat this special case everywhere makes the
    code needlessly complex. The parameter for zeroing is always
    KM_USER0 except in one single case that we open code.

    Avoiding KM_USER0 makes a lot of code not having to be dealing
    with the special casing for HIGHMEM anymore. Dealing with
    kmap is only necessary for HIGHMEM configurations. In those
    configurations we use KM_USER0 like we do for a series of other
    functions defined in highmem.h.

    Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
    function could not be a macro. zero_user_* functions introduced
    here can be be inline because that constant is not used when these
    functions are called.

    Also extract the flushing of the caches to be outside of the kmap.

    [akpm@linux-foundation.org: fix nfs and ntfs build]
    [akpm@linux-foundation.org: fix ntfs build some more]
    Signed-off-by: Christoph Lameter
    Cc: Steven French
    Cc: Michael Halcrow
    Cc:
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: Michael Halcrow
    Cc: Steven French
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Oct, 2007

1 commit

  • The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
    (and thus mapcounted and count towards shared rss). These writes to
    the struct page could cause excessive cacheline bouncing on big
    systems. There are a number of ways this could be addressed if it is
    an issue.

    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).

    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely

    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.

    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).

    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.

    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.

    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.

    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
    much use to them except on benchmarks. All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus "snif" Torvalds

    Nick Piggin
     

10 Oct, 2007

1 commit

  • As bi_end_io is only called once when the reqeust is complete,
    the 'size' argument is now redundant. Remove it.

    Now there is no need for bio_endio to subtract the size completed
    from bi_size. So don't do that either.

    While we are at it, change bi_end_io to return void.

    Signed-off-by: Neil Brown
    Signed-off-by: Jens Axboe

    NeilBrown
     

21 Aug, 2007

1 commit

  • This patch uses kzalloc to zero all of struct dio rather than manually
    trying to track which fields we rely on being zero. It passed aio+dio
    stress testing and some bug regression testing on ext3.

    This patch was introduced by Linus in the conversation that lead up to
    Badari's minimal fix to manually zero .map_bh.b_state in commit:

    6a648fa72161d1f6468dabd96c5d3c0db04f598a

    It makes the code a bit smaller. Maybe a couple fewer cachelines to
    load, if we're lucky:

    text data bss dec hex filename
    3285925 568506 1304616 5159047 4eb887 vmlinux
    3285797 568506 1304616 5158919 4eb807 vmlinux.patched

    I was unable to measure a stable difference in the number of cpu cycles
    spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
    ~340MB/s.

    So the resulting intent of the patch isn't a performance gain but to
    avoid exposing ourselves to the risk of finding another field like
    .map_bh.b_state where we rely on zeroing but don't enforce it in the
    code.

    Signed-off-by: Zach Brown
    Signed-off-by: Linus Torvalds

    Zach Brown
     

12 Aug, 2007

1 commit

  • Need to initialize map_bh.b_state to zero. Otherwise, in case of a faulty
    user-buffer its possible to go into dio_zero_block() and submit a page by
    mistake - since it checks for buffer_new().

    http://marc.info/?l=linux-kernel&m=118551339032528&w=2

    akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got
    lost. Probably this version is better for -stable anwyay.

    Signed-off-by: Badari Pulavarty
    Acked-by: Joe Jin
    Acked-by: Zach Brown
    Cc: gurudas pai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

04 Jul, 2007

1 commit

  • Badari Pulavarty reported a case of this BUG_ON is triggering during
    testing. It's completely bogus and should be removed.

    It's trying to notice if we left references to the dio hanging around in
    the sync case. They should have been dropped as IO completed while this
    path was in dio_await_completion(). This condition will also be
    checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
    lines lower. So to start this BUG_ON() is redundant.

    More fatally, it's dereferencing dio-> after having dropped its
    reference. It's only safe to dereference the dio after releasing the
    lock if the final reference was just dropped. Another CPU might free
    the dio in bio completion and reuse the memory after this path drops the
    dio lock but before the BUG_ON() is evaluated.

    This patch passed aio+dio regression unit tests and aio-stress on ext3.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

10 May, 2007

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
    sound: convert "sound" subdirectory to UTF-8
    MAINTAINERS: Add cxacru website/mailing list
    include files: convert "include" subdirectory to UTF-8
    general: convert "kernel" subdirectory to UTF-8
    documentation: convert the Documentation directory to UTF-8
    Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
    remove broken URLs from net drivers' output
    Magic number prefix consistency change to Documentation/magic-number.txt
    trivial: s/i_sem /i_mutex/
    fix file specification in comments
    drivers/base/platform.c: fix small typo in doc
    misc doc and kconfig typos
    Remove obsolete fat_cvf help text
    Fix occurrences of "the the "
    Fix minor typoes in kernel/module.c
    Kconfig: Remove reference to external mqueue library
    Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
    Correct comments in genrtc.c to refer to correct /proc file.
    Fix more "deprecated" spellos.
    Fix "deprecated" typoes.
    ...

    Fix trivial comment conflict in kernel/relay.c.

    Linus Torvalds
     
  • It's very common for file systems to need to zero part or all of a page,
    the simplist way is just to use kmap_atomic() and memset(). There's
    actually a library function in include/linux/highmem.h that does exactly
    that, but it's confusingly named memclear_highpage_flush(), which is
    descriptive of *how* it does the work rather than what the *purpose* is.
    So this patchset renames the function to zero_user_page(), and calls it
    from the various places that currently open code it.

    This first patch introduces the new function call, and converts all the
    core kernel callsites, both the open-coded ones and the old
    memclear_highpage_flush() ones. Following this patch is a series of
    conversions for each file system individually, per AKPM, and finally a
    patch deprecating the old call. The diffstat below shows the entire
    patchset.

    [akpm@linux-foundation.org: fix a few things]
    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     

09 May, 2007

1 commit


11 Dec, 2006

6 commits

  • The wait_for_more_bios() function name was poorly chosen. While looking to
    clean it up it I noticed that the dio struct refcounting between the bio
    completion and dio submission paths was racey.

    The bio submission path was simply freeing the dio struct if
    atomic_dec_and_test() indicated that it dropped the final reference.

    The aio bio completion path was dereferencing its dio struct pointer *after
    dropping its reference* based on the remaining number of references.

    These two paths could race and result in the aio bio completion path
    dereferencing a freed dio, though this was not observed in the wild.

    This moves the refcount under the bio lock so that bio completion can drop
    its reference and decide to wake all in one atomic step.

    Once testing and waking is locked dio_await_one() can test its sleeping
    condition and mark itself uninterruptible under the lock. It gets simpler
    and wait_for_more_bios() disappears.

    The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
    looks alarming. This lock acquiry existed in that path before the recent
    dio completion patch set. We shouldn't expect significant performance
    regression from returning to the behaviour that existed before the
    completion clean up work.

    This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Cc: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • The only time it is safe to call aio_complete() is when the ->ki_retry
    function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has
    historically done this by relying on its caller to translate positive return
    codes into -EIOCBQUEUED for the aio case. It did this by trying to keep
    conditionals in sync. direct_io_worker() knew when finished_one_bio() was
    going to call aio_complete(). It would reverse the test and wait and free the
    dio in the cases it thought that finished_one_bio() wasn't going to.

    Not surprisingly, it ended up getting it wrong. 'ret' could be a negative
    errno from the submission path but it failed to communicate this to
    finished_one_bio(). direct_io_worker() would return < 0, it's callers
    wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the
    future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
    would be called for a second time which can manifest as an oops.

    The previous cleanups have whittled the sync and async completion paths down
    to the point where we can collapse them and clearly reassert the invariant
    that we must only call aio_complete() after returning -EIOCBQUEUED.
    direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
    drop the dio refcount and the aio bio completion path will only call
    aio_complete() when it is the last to drop the dio refcount.
    direct_io_worker() can ensure that it is the last to drop the reference count
    by waiting for bios to drain. It does this for sync ops, of course, and for
    partial dio writes that must fall back to buffered and for aio ops that saw
    errors during submission.

    This means that operations that end up waiting, even if they were issued as
    aio ops, will not call aio_complete() from dio. Instead we return the return
    code of the operation and let the aio core call aio_complete(). This is
    purposely done to fix a bug where AIO DIO file extensions would call
    aio_complete() before their callers have a chance to update i_size.

    Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
    no longer have to translate for it. XFS needs to be careful not to free
    resources that will be used during AIO completion if -EIOCBQUEUED is returned.
    We maintain the previous behaviour of trying to write fs metadata for O_SYNC
    aio+dio writes.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • Now that we have a single refcount and waiting path we can reuse it in the
    async 'should_wait' path. It continues to rely on the fragile link between
    the conditional in dio_complete_aio() which decides to complete the AIO and
    the conditional in direct_io_worker() which decides to wait and free.

    By waiting before dropping the reference we stop dio_bio_end_aio() from
    calling dio_complete_aio() which used to wake up the waiter after seeing the
    reference count drop to 0. We hoist this wake up into dio_bio_end_aio() which
    now notices when it's left a single remaining reference that is held by the
    waiter.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • Previously we had two confusing counts of bio progress. 'bio_count' was
    decremented as bios were processed and freed by the dio core. It was used to
    indicate final completion of the dio operation. 'bios_in_flight' reflected
    how many bios were between submit_bio() and bio->end_io. It was used by the
    sync path to decide when to wake up and finish completing bios and was ignored
    by the async path.

    This patch collapses the two notions into one notion of a dio reference count.
    bios hold a dio reference when they're between submit_bio and bio->end_io.

    Since bios_in_flight was only used in the sync path it is now equivalent to
    dio->refcount - 1 which accounts for direct_io_worker() holding a reference
    for the duration of the operation.

    dio_bio_complete() -> finished_one_bio() was called from the sync path after
    finding bios on the list that the bio->end_io function had deposited.
    finished_one_bio() can not drop the dio reference on behalf of these bios now
    because bio->end_io already has. The is_async test in finished_one_bio()
    meant that it never actually did anything other than drop the bio_count for
    sync callers. So we remove its refcount decrement, don't call it from
    dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
    caller after an explicit refcount decrement. It is renamed dio_complete_aio()
    to reflect the remaining work it actually does.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • We only need to call blk_run_address_space() once after all the bios for the
    direct IO op have been submitted. This removes the chance of calling
    blk_run_address_space() after spurious wake ups as the sync path waits for
    bios to drain. It's also one less difference betwen the sync and async paths.

    In the process we remove a redundant dio_bio_submit() that its caller had
    already performed.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • There have been a lot of bugs recently due to the way direct_io_worker() tries
    to decide how to finish direct IO operations. In the worst examples it has
    failed to call aio_complete() at all (hang) or called it too many times
    (oops).

    This set of patches cleans up the completion phase with the goal of removing
    the complexity that lead to these bugs. We end up with one path that
    calculates the result of the operation after all off the bios have completed.
    We decide when to generate a result of the operation using that path based on
    the final release of a refcount on the dio structure.

    I tried to progress towards the final state in steps that were relatively easy
    to understand. Each step should compile but I only tested the final result of
    having all the patches applied.

    I've tested these on low end PC drives with aio-stress, the direct IO tests I
    could manage to get running in LTP, orasim, and some home-brew functional
    tests.

    In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
    running DIO LTP tests. They found that XFS bug which has since been addressed
    in the patch series.

    This patch:

    The mechanics which decide the result of a direct IO operation were duplicated
    in the sync and async paths.

    The async path didn't check page_errors which can manifest as silently
    returning success when the final pointer in an operation faults and its
    matching file region is filled with zeros.

    The sync path and async path differed in whether they passed errors to the
    caller's dio->end_io operation. The async path was passing errors to it which
    trips an assertion in XFS, though it is apparently harmless.

    This centralizes the completion phase of dio ops in one place. AIO will now
    return EFAULT consistently and all paths fall back to the previously sync
    behaviour of passing the number of bytes 'transferred' to the dio->end_io
    callback, regardless of errors.

    dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
    now that it's being propogated through dio_complete() via dio->io_error. This
    lets it return void which simplifies its sole caller.

    Signed-off-by: Zach Brown
    Cc: Badari Pulavarty
    Cc: Suparna Bhattacharya
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown