18 Apr, 2011

1 commit


16 Apr, 2011

1 commit

  • Linus correctly observes that the most important dispatch cases
    are now done from kblockd, this isn't ideal for latency reasons.
    The original reason for switching dispatches out-of-line was to
    avoid too deep a stack, so by _only_ letting the "accidental"
    flush directly in schedule() be guarded by offload to kblockd,
    we should be able to get the best of both worlds.

    So add a blk_schedule_flush_plug() that offloads to kblockd,
    and only use that from the schedule() path.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2011

2 commits

  • For the explicit unplugging, we'd prefer to kick things off
    immediately and not pay the penalty of the latency to switch
    to kblockd. So let blk_finish_plug() do the run inline, while
    the implicit-on-schedule-out unplug will punt to kblockd.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's a bit of a mess currently. task->plug is being cleared
    and reset in __blk_finish_plug(), and blk_finish_plug() is
    testing for a NULL plug which cannot happen even from schedule()
    anymore since it uses blk_needs_flush_plug() to determine
    whether to call into this function at all.

    So get rid of some of the cruft.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Apr, 2011

1 commit


06 Apr, 2011

1 commit

  • The current block integrity (DIF/DIX) support in DM is verifying that
    all devices' integrity profiles match during DM device resume (which
    is past the point of no return). To some degree that is unavoidable
    (stacked DM devices force this late checking). But for most DM
    devices (which aren't stacking on other DM devices) the ideal time to
    verify all integrity profiles match is during table load.

    Introduce the notion of an "initialized" integrity profile: a profile
    that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
    template. Add blk_integrity_is_initialized() to allow checking if a
    profile was initialized.

    Update DM integrity support to:
    - check all devices with _initialized_ integrity profiles match
    during table load; uninitialized profiles (e.g. for underlying DM
    device(s) of a stacked DM device) are ignored.
    - disallow a table load that would result in an integrity profile that
    conflicts with a DM device's existing (in-use) integrity profile
    - avoid clearing an existing integrity profile
    - validate all integrity profiles match during resume; but if they
    don't all we can do is report the mismatch (during resume we're past
    the point of no return)

    Signed-off-by: Mike Snitzer
    Cc: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

12 Mar, 2011

1 commit


10 Mar, 2011

4 commits

  • Conflicts:
    block/blk-core.c
    block/blk-flush.c
    drivers/md/raid1.c
    drivers/md/raid10.c
    drivers/md/raid5.c
    fs/nilfs2/btnode.c
    fs/nilfs2/mdt.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we use plugging for that, but as plugging is going away,
    we need an alternative mechanism.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

03 Mar, 2011

1 commit

  • Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
    written in such a way that it needs queue lock. In blk_release_queue()
    there is no gurantee that ->queue_lock is still around.

    Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
    one problem.

    https://lkml.org/lkml/2010/10/23/86

    And a quick fix moved blk_throtl_exit() to blk_release_queue().

    commit 7ad58c028652753814054f4e3ac58f925e7343f4
    Author: Jens Axboe
    Date: Sat Oct 23 20:40:26 2010 +0200

    block: fix use-after-free bug in blk throttle code

    This patch reverts above change and does not try to shutdown the
    throtl work in blk_sync_queue(). By avoiding call to
    throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
    the problem reported by Ingo.

    blk_sync_queue() seems to be used only by md driver and it seems to be
    using it to make sure q->unplug_fn is not called as md registers its
    own unplug functions and it is about to free up the data structures
    used by unplug_fn(). Block throttle does not call back into unplug_fn()
    or into md. So there is no need to cancel blk throttle work.

    In fact I think cancelling block throttle work is bad because it might
    happen that some bios are throttled and scheduled to be dispatched later
    with the help of pending work and if work is cancelled, these bios might
    never be dispatched.

    Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
    blk_release_queue() time. That should be safe as we are also calling
    blk_throtl_exit() which should make sure all the throttling related
    data structures are cleaned up.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

02 Mar, 2011

3 commits

  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • o Dominik Klein reported a system hang issue while doing some blkio
    throttling testing.

    https://lkml.org/lkml/2011/2/24/173

    o Some tracing revealed that CFQ was not dispatching any more jobs as
    queue unplug was not happening. And queue unplug was not happening
    because unplug work was not being called as there was one throttling
    work on same cpu which as not finished yet. And throttling work had not
    finished as it was tyring to dispatch a bio to CFQ but all the request
    descriptors were consume to it was put to sleep.

    o So basically it is a cyclic dependecny between CFQ unplug work and
    throtl dispatch work. Tejun suggested that use separate workqueue for
    such cases.

    o This patch uses a separate workqueue for throttle related work and
    does not rely on kblockd workqueue anymore.

    Cc: stable@kernel.org
    Reported-by: Dominik Klein
    Signed-off-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

11 Feb, 2011

1 commit

  • Flush requests are never put on the IO scheduler. Convert request
    structure's elevator_private* into an array and have the flush fields
    share a union with it.

    Reclaim the space lost in 'struct request' by moving 'completion_data'
    back in the union with 'rb_node'.

    Signed-off-by: Mike Snitzer
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

25 Jan, 2011

1 commit

  • The current FLUSH/FUA support has evolved from the implementation
    which had to perform queue draining. As such, sequencing is done
    queue-wide one flush request after another. However, with the
    draining requirement gone, there's no reason to keep the queue-wide
    sequential approach.

    This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
    request is sequenced individually. The actual FLUSH execution is
    double buffered and whenever a request wants to execute one for either
    PRE or POSTFLUSH, it queues on the pending queue. Once certain
    conditions are met, a flush request is issued and on its completion
    all pending requests proceed to the next sequence.

    This allows arbitrary merging of different type of flushes. How they
    are merged can be primarily controlled and tuned by adjusting the
    above said 'conditions' used to determine when to issue the next
    flush.

    This is inspired by Darrick's patches to merge multiple zero-data
    flushes which helps workloads with highly concurrent fsync requests.

    * As flush requests are never put on the IO scheduler, request fields
    used for flush share space with rq->rb_node. rq->completion_data is
    moved out of the union. This increases the request size by one
    pointer.

    As rq->elevator_private* are used only by the iosched too, it is
    possible to reduce the request size further. However, to do that,
    we need to modify request allocation path such that iosched data is
    not allocated for flush requests.

    * FLUSH/FUA processing happens on insertion now instead of dispatch.

    - Comments updated as per Vivek and Mike.

    Signed-off-by: Tejun Heo
    Cc: "Darrick J. Wong"
    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Cc: Vivek Goyal
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

13 Jan, 2011

1 commit


05 Jan, 2011

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    Also add a refcount to struct hd_struct to keep the partition in
    memory as long as users exist. We use kref_test_and_get() to ensure
    we don't add a reference to a partition which is going away.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

17 Dec, 2010

4 commits

  • Implement blk_limits_max_hw_sectors() and make
    blk_queue_max_hw_sectors() a wrapper around it.

    DM needs this to avoid setting queue_limits' max_hw_sectors and
    max_sectors directly. dm_set_device_limits() now leverages
    blk_limits_max_hw_sectors() logic to establish the appropriate
    max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
    incorrectly setting max_sectors rather than max_hw_sectors (which
    caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).

    Signed-off-by: Mike Snitzer
    Cc: stable@kernel.org
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • When stacking devices, a request_queue is not always available. This
    forced us to have a no_cluster flag in the queue_limits that could be
    used as a carrier until the request_queue had been set up for a
    metadevice.

    There were several problems with that approach. First of all it was up
    to the stacking device to remember to set queue flag after stacking had
    completed. Also, the queue flag and the queue limits had to be kept in
    sync at all times. We got that wrong, which could lead to us issuing
    commands that went beyond the max scatterlist limit set by the driver.

    The proper fix is to avoid having two flags for tracking the same thing.
    We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
    block layer merging functions. The queue_limit 'no_cluster' is turned
    into 'cluster' to avoid double negatives and to ease stacking.
    Clustering defaults to being enabled as before. The queue flag logic is
    removed from the stacking function, and explicitly setting the cluster
    flag is no longer necessary in DM and MD.

    Reported-by: Ed Lin
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Currently, media presence polling for removeable block devices is done
    from userland. There are several issues with this.

    * Polling is done by periodically opening the device. For SCSI
    devices, the command sequence generated by such action involves a
    few different commands including TEST_UNIT_READY. This behavior,
    while perfectly legal, is different from Windows which only issues
    single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
    ATAPI devices lock up after being periodically queried such command
    sequences.

    * There is no reliable and unintrusive way for a userland program to
    tell whether the target device is safe for media presence polling.
    For example, polling for media presence during an on-going burning
    session can make it fail. The polling program can avoid this by
    opening the device with O_EXCL but then it risks making a valid
    exclusive user of the device fail w/ -EBUSY.

    * Userland polling is unnecessarily heavy and in-kernel implementation
    is lighter and better coordinated (workqueue, timer slack).

    This patch implements framework for in-kernel disk event handling,
    which includes media presence polling.

    * bdops->check_events() is added, which supercedes ->media_changed().
    It should check whether there's any pending event and return if so.
    Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
    DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
    called parallelly.

    * gendisk->events and ->async_events are added. These should be
    initialized by block driver before passing the device to add_disk().
    The former contains the mask of all supported events and the latter
    the mask of all events which the device can report without polling.
    /sys/block/*/events[_async] export these to userland.

    * Kernel parameter block.events_dfl_poll_msecs controls the system
    polling interval (default is 0 which means disable) and
    /sys/block/*/events_poll_msecs control polling intervals for
    individual devices (default is -1 meaning use system setting). Note
    that if a device can report all supported events asynchronously and
    its polling interval isn't explicitly set, the device won't be
    polled regardless of the system polling interval.

    * If a device is opened exclusively with write access, event checking
    is automatically disabled until all write exclusive accesses are
    released.

    * There are event 'clearing' events. For example, both of currently
    defined events are cleared after the device has been successfully
    opened. This information is passed to ->check_events() callback
    using @clearing argument as a hint.

    * Event checking is always performed from system_nrt_wq and timer
    slack is set to 25% for polling.

    * Nothing changes for drivers which implement ->media_changed() but
    not ->check_events(). Going forward, all drivers will be converted
    to ->check_events() and ->media_change() will be dropped.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no reason for register_disk() and del_gendisk() to be in
    fs/partitions/check.c. Move both to genhd.c. While at it, collapse
    unlink_gendisk(), which was artificially in a separate function due to
    genhd.c / check.c split, into del_gendisk().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Nov, 2010

1 commit

  • REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
    at this point is:

    - various checks inside the block layer.
    - sanity checks in bio based drivers.
    - now unused bio_empty_barrier helper.
    - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
    but Xen really needs to sort out it's barrier situaton.
    - setting of ordered tags in uas - dead code copied from old scsi
    drivers.
    - scsi different retry for barriers - it's dead and should have been
    removed when flushes were converted to FS requests.
    - blktrace handling of barriers - removed. Someone who knows blktrace
    better should add support for REQ_FLUSH and REQ_FUA, though.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Oct, 2010

2 commits


25 Oct, 2010

1 commit


23 Oct, 2010

1 commit

  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     

19 Oct, 2010

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    When reloading partition tables, quiesce IO to ensure that no
    request references to the partition struct exists. When it is safe
    to free the partition table, the IO for that device is restarted
    again.

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     

14 Oct, 2010

1 commit


17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 Sep, 2010

1 commit


15 Sep, 2010

1 commit

  • Change type of 2nd parameter of blk_rq_aligned() into unsigned long
    and remove unnecessary casting. Now we can call it with 'uaddr'
    instead of 'ubuf' in __blk_rq_map_user() so that it can remove
    following warnings from sparse:

    block/blk-map.c:57:31: warning: incorrect type in argument 2 (different address spaces)
    block/blk-map.c:57:31: expected void *addr
    block/blk-map.c:57:31: got void [noderef] *ubuf

    However blk_rq_map_kern() needs one more local variable to handle it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Jens Axboe

    Namhyung Kim
     

11 Sep, 2010

1 commit

  • Some controllers have a hardware limit on the number of protection
    information scatter-gather list segments they can handle.

    Introduce a max_integrity_segments limit in the block layer and provide
    a new scsi_host_template setting that allows HBA drivers to provide a
    value suitable for the hardware.

    Add support for honoring the integrity segment limit when merging both
    bios and requests.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

10 Sep, 2010

4 commits

  • Remove support for barriers on discards, which is unused now. Also
    remove the DISCARD_NOBARRIER I/O type in favour of just setting the
    rw flags up locally in blkdev_issue_discard.

    tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate
    that and to make the interface less confusing pass all flags explicitly.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Now that the backend conversion is complete, export sequenced
    FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
    device cache should be flushed before executing the request. REQ_FUA
    means that the data in the request should be on non-volatile media on
    completion.

    Block layer will choose the correct way of implementing the semantics
    and execute it. The request may be passed to the device directly if
    the device can handle it; otherwise, it will be sequenced using one or
    more proxy requests. Devices will never see REQ_FLUSH and/or FUA
    which it doesn't support.

    Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
    never failed with -EOPNOTSUPP. If the underlying device doesn't
    support FLUSH/FUA, the block layer simply make those noop. IOW, it no
    longer distinguishes between writeback cache which doesn't support
    cache flush and writethrough/no cache. Devices which have WB cache
    w/o flush are very difficult to come by these days and there's nothing
    much we can do anyway, so it doesn't make sense to require everyone to
    implement -EOPNOTSUPP handling. This will simplify filesystems and
    block drivers as they can drop -EOPNOTSUPP retry logic for barriers.

    * QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
    blk-flush.c.

    * REQ_FLUSH w/o data can also be directly passed to drivers without
    sequencing but some drivers assume that zero length requests don't
    have rq->bio which isn't true for these requests requiring the use
    of proxy requests.

    * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
    copied from bio to request.

    * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
    WRITE_FLUSH_FUA are added.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With ordering requirements dropped, barrier and ordered are misnomers.
    Now all block layer does is sequencing FLUSH and FUA. Rename them to
    flush.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo