08 Mar, 2011

1 commit


10 Nov, 2010

1 commit

  • REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
    at this point is:

    - various checks inside the block layer.
    - sanity checks in bio based drivers.
    - now unused bio_empty_barrier helper.
    - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
    but Xen really needs to sort out it's barrier situaton.
    - setting of ordered tags in uas - dead code copied from old scsi
    drivers.
    - scsi different retry for barriers - it's dead and should have been
    removed when flushes were converted to FS requests.
    - blktrace handling of barriers - removed. Someone who knows blktrace
    better should add support for REQ_FLUSH and REQ_FUA, though.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Oct, 2010

1 commit


11 Sep, 2010

1 commit

  • Some controllers have a hardware limit on the number of protection
    information scatter-gather list segments they can handle.

    Introduce a max_integrity_segments limit in the block layer and provide
    a new scsi_host_template setting that allows HBA drivers to provide a
    value suitable for the hardware.

    Add support for honoring the integrity segment limit when merging both
    bios and requests.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

08 Aug, 2010

3 commits

  • linux/fs.h hard coded READ/WRITE constants which should match BIO_RW_*
    flags. This is fragile and caused breakage during BIO_RW_* flag
    rearrangement. The hardcoding is to avoid include dependency hell.

    Create linux/bio_types.h which contatins definitions for bio data
    structures and flags and include it from bio.h and fs.h, and make fs.h
    define all READ/WRITE related constants in terms of BIO_RW_* flags.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • SCSI-ml needs a way to mark a request as flush request in
    q->prepare_flush_fn because it needs to identify them later (e.g. in
    q->request_fn or prep_rq_fn).

    queue_flush sets REQ_HARDBARRIER in rq->cmd_flags however the block
    layer also sends normal REQ_TYPE_FS requests with REQ_HARDBARRIER. So
    SCSI-ml can't use REQ_HARDBARRIER to identify flush requests.

    We could change the block layer to clear REQ_HARDBARRIER bit before
    sending non flush requests to the lower layers. However, intorudcing
    the new flag looks cleaner (surely easier).

    Signed-off-by: FUJITA Tomonori
    Cc: James Bottomley
    Cc: David S. Miller
    Cc: Rusty Russell
    Cc: Alasdair G Kergon
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Nov, 2009

1 commit

  • Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
    this causes problems on architectures where the icache doesn't fill from
    the dcache or with dcache aliases. The patch fixes this.

    The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
    pointless empty cache-thrashing loops on architectures for which
    flush_dcache_page() is a no-op. Every architecture was provided with this
    flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
    equal 1 or do nothing otherwise.

    See "fix mtd_blkdevs problem with caches on some architectures" discussion
    on LKML for more information.

    Signed-off-by: Ilya Loginov
    Cc: Ingo Molnar
    Cc: David Woodhouse
    Cc: Peter Horton
    Cc: "Ed L. Cashin"
    Signed-off-by: Jens Axboe

    Ilya Loginov
     

02 Nov, 2009

1 commit


11 Sep, 2009

3 commits

  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Makes for a saner interface, instead of returning the bit position.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • bio and request use the same set of failfast bits. This patch makes
    the following changes to simplify things.

    * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
    bits coincide with __REQ_FAILFAST_* bits.

    * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
    but the matching is useless anyway. init_request_from_bio() is
    responsible for setting FAILFAST bits on FS requests and non-FS
    requests never use BIO_RW_AHEAD. Drop the code and comment from
    blk_rq_bio_prep().

    * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
    simplify FAILFAST flags handling in init_request_from_bio().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Jul, 2009

1 commit

  • This patch restores stacking ability to the block layer integrity
    infrastructure by creating a set of dedicated bip slabs. Each bip slab
    has an embedded bio_vec array at the end. This cuts down on memory
    allocations and also simplifies the code compared to the original bvec
    version. Only the largest bip slab is backed by a mempool. The pool is
    contained in the bio_set so stacking drivers can ensure forward
    progress.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

15 Jun, 2009

1 commit

  • Introduce bio_list_peek(), to obtain a pointer to the first bio on the bio_list
    without actually removing it from the list. This is needed when you want to
    serialize based on the list being empty or not.

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Jens Axboe
    Signed-off-by: Benjamin Herrenschmidt

    Geert Uytterhoeven
     

23 May, 2009

1 commit


11 May, 2009

1 commit

  • struct request has had a few different ways to represent some
    properties of a request. ->hard_* represent block layer's view of the
    request progress (completion cursor) and the ones without the prefix
    are supposed to represent the issue cursor and allowed to be updated
    as necessary by the low level drivers. The thing is that as block
    layer supports partial completion, the two cursors really aren't
    necessary and only cause confusion. In addition, manual management of
    request detail from low level drivers is cumbersome and error-prone at
    the very least.

    Another interesting duplicate fields are rq->[hard_]nr_sectors and
    rq->{hard_cur|current}_nr_sectors against rq->data_len and
    rq->bio->bi_size. This is more convoluted than the hard_ case.

    rq->[hard_]nr_sectors are initialized for requests with bio but
    blk_rq_bytes() uses it only for !pc requests. rq->data_len is
    initialized for all request but blk_rq_bytes() uses it only for pc
    requests. This causes good amount of confusion throughout block layer
    and its drivers and determining the request length has been a bit of
    black magic which may or may not work depending on circumstances and
    what the specific LLD is actually doing.

    rq->{hard_cur|current}_nr_sectors represent the number of sectors in
    the contiguous data area at the front. This is mainly used by drivers
    which transfers data by walking request segment-by-segment. This
    value always equals rq->bio->bi_size >> 9. However, data length for
    pc requests may not be multiple of 512 bytes and using this field
    becomes a bit confusing.

    In general, having multiple fields to represent the same property
    leads only to confusion and subtle bugs. With recent block low level
    driver cleanups, no driver is accessing or manipulating these
    duplicate fields directly. Drop all the duplicates. Now rq->sector
    means the current sector, rq->data_len the current total length and
    rq->bio->bi_size the current segment length. Everything else is
    defined in terms of these three and available only through accessors.

    * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
    now handles pc and fs requests equally other than rq->sector update.
    This means that now pc requests can use partial completion too (no
    in-kernel user yet tho).

    * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
    now uses byte count as the primary data length.

    * blk_rq_pos() is now guranteed to be always correct. In-block users
    converted.

    * blk_rq_bytes() is now guaranteed to be always valid as is
    blk_rq_sectors(). In-block users converted.

    * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
    More convenient one is used.

    * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
    pointer to request.

    [ Impact: API cleanup, single way to represent one property of a request ]

    Signed-off-by: Tejun Heo
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Apr, 2009

1 commit


22 Apr, 2009

1 commit

  • Impact: fix bio_kmalloc() and its destruction path

    bio_kmalloc() was broken in two ways.

    * bvec_alloc_bs() first allocates bvec using kmalloc() and then
    ignores it and allocates again like non-kmalloc bvecs.

    * bio_kmalloc_destructor() didn't check for and free bio integrity
    data.

    This patch fixes the above problems. kmalloc patch is separated out
    from bio_alloc_bioset() and allocates the requested number of bvecs as
    inline bvecs.

    * bio_alloc_bioset() no longer takes NULL @bs. None other than
    bio_kmalloc() used it and outside users can't know how it was
    allocated anyway.

    * Define and use BIO_POOL_NONE so that pool index check in
    bvec_free_bs() triggers if inline or kmalloc allocated bvec gets
    there.

    * Relocate destructors on top of each allocation function so that how
    they're used is more clear.

    Jens Axboe suggested allocating bvecs inline.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Apr, 2009

1 commit


06 Apr, 2009

1 commit

  • By default, CFQ will anticipate more IO from a given io context if the
    previously completed IO was sync. This used to be fine, since the only
    sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
    being used now, we don't want to anticipate for those.

    Add a bio/request flag that informs the IO scheduler that this is a sync
    request that we should not idle for. Introduce WRITE_ODIRECT specifically
    for O_DIRECT writes, and make sure that the other sync writes set this
    flag.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

24 Mar, 2009

1 commit


15 Mar, 2009

1 commit


18 Feb, 2009

1 commit


02 Feb, 2009

2 commits


30 Jan, 2009

4 commits


29 Dec, 2008

6 commits

  • When we go and allocate a bio for IO, we actually do two allocations.
    One for the bio itself, and one for the bi_io_vec that holds the
    actual pages we are interested in.

    This feature inlines a definable amount of io vecs inside the bio
    itself, so we eliminate the bio_vec array allocation for IO's up
    to a certain size. It defaults to 4 vecs, which is typically 16k
    of IO.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of having a global bio slab cache, add a reference to one
    in each bio_set that is created. This allows for personalized slabs
    in each bio_set, so that they can have bios of different sizes.

    This means we can personalize the bios we return. File systems may
    want to embed the bio inside another structure, to avoid allocation
    more items (and stuffing them in ->bi_private) after the get a bio.
    Or we may want to embed a number of bio_vecs directly at the end
    of a bio, to avoid doing two allocations to return a bio. This is now
    possible.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • In preparation for adding differently sized bios.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only very rarely need the mempool backing, so it makes sense to
    get rid of all but one of the mempool in a bio_set. So keep the
    largest bio_vec count mempool so we can always honor the largest
    allocation, and "upgrade" callers that fail.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Remove 8 bytes of padding from struct bio which also removes 16 bytes from
    struct bio_pair to make it 248 bytes. bio_pair then fits into one fewer
    cache lines & into a smaller slab.

    Signed-off-by: Richard Kennedy
    Signed-off-by: Jens Axboe

    Richard Kennedy
     
  • Allow the scsi request REQ_QUIET flag to be propagated to the buffer
    file system layer. The basic ideas is to pass the flag from the scsi
    request to the bio (block IO) and then to the buffer layer. The buffer
    layer can then suppress needless printks.

    This patch declutters the kernel log by removed the 40-50 (per lun)
    buffer io error messages seen during a boot in my multipath setup . It
    is a good chance any real errors will be missed in the "noise" it the
    logs without this patch.

    During boot I see blocks of messages like
    "
    __ratelimit: 211 callbacks suppressed
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242847
    Buffer I/O error on device sdm, logical block 1
    Buffer I/O error on device sdm, logical block 5242878
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242872
    "
    in my logs.

    My disk environment is multipath fiber channel using the SCSI_DH_RDAC
    code and multipathd. This topology includes an "active" and "ghost"
    path for each lun. IO's to the "ghost" path will never complete and the
    SCSI layer, via the scsi device handler rdac code, quick returns the IOs
    to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
    layer messages.

    I am wanting to extend the QUIET behavior to include the buffer file
    system layer to deal with these errors as well. I have been running this
    patch for a while now on several boxes without issue. A few runs of
    bonnie++ show no noticeable difference in performance in my setup.

    Thanks for John Stultz for the quiet_error finalization.

    Submitted-by: Keith Mannthey
    Signed-off-by: Jens Axboe

    Keith Mannthey
     

06 Nov, 2008

1 commit


18 Oct, 2008

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: remove __generic_unplug_device() from exports
    block: move q->unplug_work initialization
    blktrace: pass zfcp driver data
    blktrace: add support for driver data
    block: fix current kernel-doc warnings
    block: only call ->request_fn when the queue is not stopped
    block: simplify string handling in elv_iosched_store()
    block: fix kernel-doc for blk_alloc_devt()
    block: fix nr_phys_segments miscalculation bug
    block: add partition attribute for partition number
    block: add BIG FAT WARNING to CONFIG_DEBUG_BLOCK_EXT_DEVT
    softirq: Add support for triggering softirq work on softirqs.

    Linus Torvalds
     

17 Oct, 2008

1 commit

  • This fixes the bug reported by Nikanth Karthikesan :

    http://lkml.org/lkml/2008/10/2/203

    The root cause of the bug is that blk_phys_contig_segment
    miscalculates q->max_segment_size.

    blk_phys_contig_segment checks:

    req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size

    But blk_recalc_rq_segments might expect that req->biotail and the
    previous bio in the req are supposed be merged into one
    segment. blk_recalc_rq_segments might also expect that next_req->bio
    and the next bio in the next_req are supposed be merged into one
    segment. In such case, we merge two requests that can't be merged
    here. Later, blk_rq_map_sg gives more segments than it should.

    We need to keep track of segment size in blk_recalc_rq_segments and
    use it to see if two requests can be merged. This patch implements it
    in the similar way that we used to do for hw merging (virtual
    merging).

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

13 Oct, 2008

1 commit

  • Multipath is best at handling transport errors. If it gets a device
    error then there is not much the multipath layer can do. It will just
    access the same device but from a different path.

    This patch breaks up failfast into device, transport and driver errors.
    The multipath layers (md and dm mutlipath) only ask the lower levels to
    fast fail transport errors. The user of failfast, read ahead, will ask
    to fast fail on all errors.

    Note that blk_noretry_request will return true if any failfast bit
    is set. This allows drivers that do not support the multipath failfast
    bits to continue to fail on any failfast error like before. Drivers
    like scsi that are able to fail fast specific errors can check
    for the specific fail fast type. In the next patch I will convert
    scsi.

    Signed-off-by: Mike Christie
    Cc: Jens Axboe
    Signed-off-by: James Bottomley

    Mike Christie
     

09 Oct, 2008

1 commit