12 May, 2009

1 commit

  • Current bio_vec array index out-of-bounds test within
    __end_that_request_first() does not seem correct.
    It checks bio->bi_idx against bio->bi_vcnt, but the subsequent code
    uses idx (which is, bio->bi_idx + next_idx) as the array index into
    bio_vec array. This means that the test really make sense only at
    the first iteration of !(nr_bytes >=bio->bi_size) case (when next_idx
    == zero). Fix this by replacing bio->bi_idx with idx.
    (This patch applies to 2.6.30-rc4.)

    Signed-off-by: Kazuhisa Ichikawa
    Signed-off-by: Jens Axboe

    Kazuhisa Ichikawa
     

24 Apr, 2009

1 commit

  • This simplifies I/O stat accounting switching code and separates it
    completely from I/O scheduler switch code.

    Requests are accounted according to the state of their request queue
    at the time of the request allocation. There is no need anymore to
    flush the request queue when switching I/O accounting state.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

08 Apr, 2009

1 commit

  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y
    branch tracer: Fix for enabling branch profiling makes sparse unusable
    ftrace: Correct a text align for event format output
    Update /debug/tracing/README
    tracing/ftrace: alloc the started cpumask for the trace file
    tracing, x86: remove duplicated #include
    ftrace: Add check of sched_stopped for probe_sched_wakeup
    function-graph: add proper initialization for init task
    tracing/ftrace: fix missing include string.h
    tracing: fix incorrect return type of ns2usecs()
    tracing: remove CALLER_ADDR2 from wakeup tracer
    blktrace: fix pdu_len when tracing packet command requests
    blktrace: small cleanup in blk_msg_write()
    blktrace: NUL-terminate user space messages
    tracing: move scripts/trace/power.pl to scripts/tracing/power.pl

    Linus Torvalds
     

07 Apr, 2009

2 commits


06 Apr, 2009

3 commits


03 Apr, 2009

1 commit

  • Impact: output all of packet commands - not just the first 4 / 8 bytes

    Since commit d7e3c3249ef23b4617393c69fe464765b4ff1645 ("block: add
    large command support"), struct request->cmd has been changed from
    unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd.

    v1 -> v2: by: FUJITA Tomonori

    - make sure rq->cmd_len is always intialized, and then we can use
    rq->cmd_len instead of BLK_MAX_CDB.

    Signed-off-by: Li Zefan
    Acked-by: FUJITA Tomonori
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Jens Axboe
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

26 Mar, 2009

1 commit

  • Put a WARN_ON in __blk_put_request if it is about to
    leak bio(s). This is a serious bug that can happen in error
    handling code paths.

    For this to work I have fixed a couple of places in block/ where
    request->bio != NULL ownership was not honored. And a small cleanup
    at sg_io() while at it.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Boaz Harrosh
     

24 Mar, 2009

2 commits


02 Feb, 2009

1 commit


30 Jan, 2009

3 commits


29 Dec, 2008

5 commits

  • We just want to hand the first bits of IO to the device as fast
    as possible. Gains a few percent on the IOPS rate.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * Because barrier mode can be changed dynamically, whether barrier is
    supported or not can be determined only when actually issuing the
    barrier and there is no point in checking it earlier. Drop barrier
    support check in generic_make_request() and __make_request(), and
    update comment around the support check in blk_do_ordered().

    * There is no reason to check discard support in both
    generic_make_request() and __make_request(). Drop the check in
    __make_request(). While at it, move error action block to the end
    of the function and add unlikely() to q existence test.

    * Barrier request, be it empty or not, is never passed to low level
    driver and thus it's meaningless to try to copy back req->sector to
    bio->bi_sector on error. In addition, the notion of failed sector
    doesn't make any sense for empty barrier to begin with. Drop the
    code block from __end_that_request_first().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • After many improvements on kblockd_flush_work, it is now identical to
    cancel_work_sync, so a direct call to cancel_work_sync is suggested.

    The only difference is that cancel_work_sync is a GPL symbol,
    so no non-GPL modules anymore.

    Signed-off-by: Cheng Renquan
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Cheng Renquan
     
  • Allow the scsi request REQ_QUIET flag to be propagated to the buffer
    file system layer. The basic ideas is to pass the flag from the scsi
    request to the bio (block IO) and then to the buffer layer. The buffer
    layer can then suppress needless printks.

    This patch declutters the kernel log by removed the 40-50 (per lun)
    buffer io error messages seen during a boot in my multipath setup . It
    is a good chance any real errors will be missed in the "noise" it the
    logs without this patch.

    During boot I see blocks of messages like
    "
    __ratelimit: 211 callbacks suppressed
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242847
    Buffer I/O error on device sdm, logical block 1
    Buffer I/O error on device sdm, logical block 5242878
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242872
    "
    in my logs.

    My disk environment is multipath fiber channel using the SCSI_DH_RDAC
    code and multipathd. This topology includes an "active" and "ghost"
    path for each lun. IO's to the "ghost" path will never complete and the
    SCSI layer, via the scsi device handler rdac code, quick returns the IOs
    to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
    layer messages.

    I am wanting to extend the QUIET behavior to include the buffer file
    system layer to deal with these errors as well. I have been running this
    patch for a while now on several boxes without issue. A few runs of
    bonnie++ show no noticeable difference in performance in my setup.

    Thanks for John Stultz for the quiet_error finalization.

    Submitted-by: Keith Mannthey
    Signed-off-by: Jens Axboe

    Keith Mannthey
     
  • For sync IO, we'll often do them serialized. This means we'll be touching
    the queue timer for every IO, as opposed to only occasionally like we
    do for queued IO. Instead of deleting the timer when the last request
    is removed, just let continue running. If a new request comes up soon
    we then don't have to readd the timer again. If no new requests arrive,
    the timer will expire without side effect later.

    This improves high iops sync IO by ~1%.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Dec, 2008

1 commit


03 Dec, 2008

2 commits

  • Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
    devices.

    When stacking devices (LVM over MD over SCSI) some of the request queue
    parameters are not set up correctly in some cases by default, namely
    max_segment_size and and seg_boundary mask.

    If you create MD device over SCSI, these attributes are zeroed.

    Problem become when there is over this mapping next device-mapper mapping
    - queue attributes are set in DM this way:

    request_queue max_segment_size seg_boundary_mask
    SCSI 65536 0xffffffff
    MD RAID1 0 0
    LVM 65536 -1 (64bit)

    Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
    physical segments according to these parameters.

    During the generic_make_request() is segment cout recalculated and can
    increase bio->bi_phys_segments count over the allowed limit. (After
    bio_clone() in stack operation.)

    Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

    (MAXSEGENTRIES is 31 by default.)

    Sometimes even this command is enough to cause oops:

    dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

    This command generates bios with 250 sectors, allocated in 32 4k-pages
    (last page uses only 1024 bytes).

    For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
    unfortunatelly on lower layer it is recalculated to 32 segments and this
    violates CCISS restriction and triggers BUG_ON().

    The patch tries to fix it by:

    * initializing attributes above in queue request constructor
    blk_queue_make_request()

    * make sure that blk_queue_stack_limits() inherits setting

    (DM uses its own function to set the limits because it
    blk_queue_stack_limits() was introduced later. It should probably switch
    to use generic stack limit function too.)

    * sets the default seg_boundary value in one place (blkdev.h)

    * use this mask as default in DM (instead of -1, which differs in 64bit)

    Bugs related to this:
    https://bugzilla.redhat.com/show_bug.cgi?id=471639
    http://bugzilla.kernel.org/show_bug.cgi?id=8672

    Signed-off-by: Milan Broz
    Reviewed-by: Alasdair G Kergon
    Cc: Neil Brown
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Miller
    Signed-off-by: Jens Axboe

    Milan Broz
     
  • blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
    both start the timeout timer. Barrier code dequeues the original
    barrier request but doesn't passes the request itself to lower level
    driver, only broken down proxy requests; however, as the original
    barrier code goes through the same dequeue path and timeout timer is
    started on it. If barrier sequence takes long enough, this timer
    expires but the low level driver has no idea about this request and
    oops follows.

    Timeout timer shouldn't have been started on the original barrier
    request as it never goes through actual IO. This patch unexports
    elv_dequeue_request(), which has no external user anyway, and makes it
    operate on elevator proper w/o adding the timer and make
    blkdev_dequeue_request() call elv_dequeue_request() and add timer.
    Internal users which don't pass the request to driver - barrier code
    and end_that_request_last() - are converted to use
    elv_dequeue_request().

    Signed-off-by: Tejun Heo
    Cc: Mike Anderson
    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Nov, 2008

2 commits

  • Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
    sites. Spread them out to the usage sites, as suggested by
    Mathieu Desnoyers.

    Signed-off-by: Ingo Molnar
    Acked-by: Mathieu Desnoyers

    Ingo Molnar
     
  • This was a forward port of work done by Mathieu Desnoyers, I changed it to
    encode the 'what' parameter on the tracepoint name, so that one can register
    interest in specific events and not on classes of events to then check the
    'what' parameter.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

06 Nov, 2008

1 commit


18 Oct, 2008

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: remove __generic_unplug_device() from exports
    block: move q->unplug_work initialization
    blktrace: pass zfcp driver data
    blktrace: add support for driver data
    block: fix current kernel-doc warnings
    block: only call ->request_fn when the queue is not stopped
    block: simplify string handling in elv_iosched_store()
    block: fix kernel-doc for blk_alloc_devt()
    block: fix nr_phys_segments miscalculation bug
    block: add partition attribute for partition number
    block: add BIG FAT WARNING to CONFIG_DEBUG_BLOCK_EXT_DEVT
    softirq: Add support for triggering softirq work on softirqs.

    Linus Torvalds
     

17 Oct, 2008

4 commits

  • The only out-of-core user is IDE, and that should be using
    blk_start_queueing() instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • modprobe loop; rmmod loop effectively creates a blk_queue and destroys it
    which results in q->unplug_work being canceled without it ever being
    initialized.

    Therefore, move the initialization of q->unplug_work from
    blk_queue_make_request() to blk_alloc_queue*().

    Reported-by: Alexey Dobriyan
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Peter Zijlstra
     
  • Fix block kernel-doc warnings:

    Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path'
    Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu'
    Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part'
    Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Randy Dunlap
     
  • Callers should use either blk_run_queue/__blk_run_queue, or
    blk_start_queueing() to invoke request handling instead of calling
    ->request_fn() directly as that does not take the queue stopped
    flag into account.

    Also add appropriate comments on the above functions to detail
    their usage.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Oct, 2008

1 commit

  • Multipath is best at handling transport errors. If it gets a device
    error then there is not much the multipath layer can do. It will just
    access the same device but from a different path.

    This patch breaks up failfast into device, transport and driver errors.
    The multipath layers (md and dm mutlipath) only ask the lower levels to
    fast fail transport errors. The user of failfast, read ahead, will ask
    to fast fail on all errors.

    Note that blk_noretry_request will return true if any failfast bit
    is set. This allows drivers that do not support the multipath failfast
    bits to continue to fail on any failfast error like before. Drivers
    like scsi that are able to fail fast specific errors can check
    for the specific fail fast type. In the next patch I will convert
    scsi.

    Signed-off-by: Mike Christie
    Cc: Jens Axboe
    Signed-off-by: James Bottomley

    Mike Christie
     

09 Oct, 2008

7 commits

  • This patch removes end_queued_request() and end_dequeued_request(),
    which are no longer used.

    As a results, users of __end_request() became only end_request().
    So the actual code in __end_request() is moved to end_request()
    and __end_request() is removed.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • This patch adds an new interface, blk_lld_busy(), to check lld's
    busy state from the block layer.
    blk_lld_busy() calls down into low-level drivers for the checking
    if the drivers set q->lld_busy_fn() using blk_queue_lld_busy().

    This resolves a performance problem on request stacking devices below.

    Some drivers like scsi mid layer stop dispatching request when
    they detect busy state on its low-level device like host/target/device.
    It allows other requests to stay in the I/O scheduler's queue
    for a chance of merging.

    Request stacking drivers like request-based dm should follow
    the same logic.
    However, there is no generic interface for the stacked device
    to check if the underlying device(s) are busy.
    If the request stacking driver dispatches and submits requests to
    the busy underlying device, the requests will stay in
    the underlying device's queue without a chance of merging.
    This causes performance problem on burst I/O load.

    With this patch, busy state of the underlying device is exported
    via q->lld_busy_fn(). So the request stacking driver can check it
    and stop dispatching requests if busy.

    The underlying device driver must return the busy state appropriately:
    1: when the device driver can't process requests immediately.
    0: when the device driver can process requests immediately,
    including abnormal situations where the device driver needs
    to kill all requests.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • blk_start_queueing() should act like the generic queue unplugging
    and kicking and ignore a stopped queue. Such a queue may not be
    run until after a call to blk_start_queue().

    Signed-off-by: Elias Oltmanns
    Signed-off-by: Jens Axboe

    Elias Oltmanns
     
  • This patch adds a queue flag to indicate the block device can be
    used for request stacking.

    Request stacking drivers need to stack their devices on top of
    only devices of which q->request_fn is functional.
    Since bio stacking drivers (e.g. md, loop) basically initialize
    their queue using blk_alloc_queue() and don't set q->request_fn,
    the check of (q->request_fn == NULL) looks enough for that purpose.

    However, dm will become both types of stacking driver (bio-based and
    request-based). And dm will always set q->request_fn even if the dm
    device is bio-based of which q->request_fn is not functional actually.
    So we need something else to distinguish the type of the device.
    Adding a queue flag is a solution for that.

    The reason why dm always sets q->request_fn is to keep
    the compatibility of dm user-space tools.
    Currently, all dm user-space tools are using bio-based dm without
    specifying the type of the dm device they use.
    To use request-based dm without changing such tools, the kernel
    must decide the type of the dm device automatically.
    The automatic type decision can't be done at the device creation time
    and needs to be deferred until such tools load a mapping table,
    since the actual type is decided by dm target type included in
    the mapping table.

    So a dm device has to be initialized using blk_init_queue()
    so that we can load either type of table.
    Then, all queue stuffs are set (e.g. q->request_fn) and we have
    no element to distinguish that it is bio-based or request-based,
    even after a table is loaded and the type of the device is decided.

    By the way, some stuffs of the queue (e.g. request_list, elevator)
    are needless when the dm device is used as bio-based.
    But the memory size is not so large (about 20[KB] per queue on ia64),
    so I hope the memory loss can be acceptable for bio-based dm users.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • This patch adds blk_insert_cloned_request(), a generic request
    submission interface for request stacking drivers.
    Request-based dm will use it to submit their clones to underlying
    devices.

    blk_rq_check_limits() is also added because it is possible that
    the lower queue has stronger limitations than the upper queue
    if multiple drivers are stacking at request-level.
    Not only for blk_insert_cloned_request()'s internal use, the function
    will be used by request-based dm when the queue limitation is
    modified (e.g. by replacing dm's table).

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • This patch adds blk_update_request(), which updates struct request
    with completing its data part, but doesn't complete the struct
    request itself.
    Though it looks like end_that_request_first() of older kernels,
    blk_update_request() should be used only by request stacking drivers.

    Request-based dm will use it in bio->bi_end_io callback to update
    the original request when a data part of a cloned request completes.
    Followings are additional background information of why request-based
    dm needs this interface.

    - Request stacking drivers can't use blk_end_request() directly from
    the lower driver's completion context (bio->bi_end_io or rq->end_io),
    because some device drivers (e.g. ide) may try to complete
    their request with queue lock held, and it may cause deadlock.
    See below for detailed description of possible deadlock:

    - To solve that, request-based dm offloads the completion of
    cloned struct request to softirq context (i.e. using
    blk_complete_request() from rq->end_io).

    - Though it is possible to use the same solution from bio->bi_end_io,
    it will delay the notification of bio completion to the original
    submitter. Also, it will cause inefficient partial completion,
    because the lower driver can't perform the cloned request anymore
    and request-based dm needs to requeue and redispatch it to
    the lower driver again later. That's not good.

    - So request-based dm needs blk_update_request() to perform the bio
    completion in the lower driver's completion context, which is more
    efficient.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • When a driver calls blk_cleanup_queue(), the device should be fully idle.
    However, the block layer may have pending plugging timers and the IO
    schedulers may have pending work in the work queues. So quisce the device
    by waiting for the timer and flushing the work queues.

    Signed-off-by: Jens Axboe

    Jens Axboe