19 Oct, 2011

4 commits

  • request_queue is refcounted but actually depdends on lifetime
    management from the queue owner - on blk_cleanup_queue(), block layer
    expects that there's no request passing through request_queue and no
    new one will.

    This is fundamentally broken. The queue owner (e.g. SCSI layer)
    doesn't have a way to know whether there are other active users before
    calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
    guarantee that the queue is and would stay valid while it's holding a
    reference.

    With delay added in blk_queue_bio() before queue_lock is grabbed, the
    following oops can be easily triggered when a device is removed with
    in-flight IOs.

    sd 0:0:1:0: [sdb] Stopping disk
    ata1.01: disabled
    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
    RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
    ...
    Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
    ...
    Call Trace:
    [] elv_merge+0x84/0xe0
    [] blk_queue_bio+0xf4/0x400
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x74/0x100
    [] dio_bio_submit+0xbc/0xc0
    [] __blockdev_direct_IO+0x92e/0xb40
    [] blkdev_direct_IO+0x57/0x60
    [] generic_file_aio_read+0x6d5/0x760
    [] do_sync_read+0xda/0x120
    [] vfs_read+0xc5/0x180
    [] sys_pread64+0x9a/0xb0
    [] system_call_fastpath+0x16/0x1b

    This happens because blk_queue_cleanup() destroys the queue and
    elevator whether IOs are in progress or not and DEAD tests are
    sprinkled in the request processing path without proper
    synchronization.

    Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
    is shutdown whether it has requests in it or not. Depending on
    timing, it either oopses or throttled bios are lost putting tasks
    which are waiting for bio completion into eternal D state.

    The way it should work is having the usual clear distinction between
    shutdown and release. Shutdown drains all currently pending requests,
    marks the queue dead, and performs partial teardown of the now
    unnecessary part of the queue. Even after shutdown is complete,
    reference holders are still allowed to issue requests to the queue
    although they will be immmediately failed. The rest of teardown
    happens on release.

    This patch makes the following changes to make blk_queue_cleanup()
    behave as proper shutdown.

    * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
    queue_lock.

    * Unsynchronized DEAD check in generic_make_request_checks() removed.
    This couldn't make any meaningful difference as the queue could die
    after the check.

    * blk_drain_queue() updated such that it can drain all requests and is
    now called during cleanup.

    * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
    drains all throttled bios during cleanup and free td when queue is
    released.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

    * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
    and drops queue_lock in the latter case. Different locking context
    depending on return value is error-prone and DEAD state is scheduled
    to be protected by queue_lock anyway. Move DEAD check inside
    queue_lock and return valid tg or NULL.

    * blk_throtl_bio() indicates return status both with its return value
    and in/out param **@bio. The former is used to indicate whether
    queue is found to be dead during throtl processing. The latter
    whether the bio is throttled.

    There's no point in returning DEAD check result from
    blk_throtl_bio(). The queue can die after blk_throtl_bio() is
    finished but before make_request_fn() grabs queue lock.

    Make it take *@bio instead and return boolean result indicating
    whether the request is throttled or not.

    This patch doesn't cause any visible functional difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Reorganize queue draining related code in preparation of queue exit
    changes.

    * Factor out actual draining from elv_quiesce_start() to
    blk_drain_queue().

    * Make elv_quiesce_start/end() responsible for their own locking.

    * Replace open-coded ELVSWITCH clearing in elevator_switch() with
    elv_quiesce_end().

    This patch doesn't cause any visible functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_throtl interface is block internal and there's no reason to have
    them in linux/blkdev.h. Move them to block/blk.h.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Aug, 2011

1 commit

  • Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
    FLUSH/FUA to support merge, introduced a performance regression when
    running any sort of fsyncing workload using dm-multipath and certain
    storage (in our case, an HP EVA). The test I ran was fs_mark, and it
    dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
    that dm-multipath always advertised flush+fua support, and passed
    commands on down the stack, where those flags used to get stripped off.
    The above commit changed that behavior:

    static inline struct request *__elv_next_request(struct request_queue *q)
    {
    struct request *rq;

    while (1) {
    - while (!list_empty(&q->queue_head)) {
    + if (!list_empty(&q->queue_head)) {
    rq = list_entry_rq(q->queue_head.next);
    - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
    - (rq->cmd_flags & REQ_FLUSH_SEQ))
    - return rq;
    - rq = blk_do_flush(q, rq);
    - if (rq)
    - return rq;
    + return rq;
    }

    Note that previously, a command would come in here, have
    REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

    struct request *blk_do_flush(struct request_queue *q, struct request *rq)
    {
    unsigned int fflags = q->flush_flags; /* may change, cache it */
    bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
    bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
    bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
    REQ_FUA);
    unsigned skip = 0;
    ...
    if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
    rq->cmd_flags &= ~REQ_FLUSH;
    if (!has_fua)
    rq->cmd_flags &= ~REQ_FUA;
    return rq;
    }

    So, the flush machinery was bypassed in such cases (q->flush_flags == 0
    && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

    Now, however, we don't get into the flush machinery at all. Instead,
    __elv_next_request just hands a request with flush and fua bits set to
    the scsi_request_fn, even if the underlying request_queue does not
    support flush or fua.

    The agreed upon approach is to fix the flush machinery to allow
    stacking. While this isn't used in practice (since there is only one
    request-based dm target, and that target will now reflect the flush
    flags of the underlying device), it does future-proof the solution, and
    make it function as designed.

    In order to make this work, I had to add a field to the struct request,
    inside the flush structure (to store the original req->end_io). Shaohua
    had suggested overloading the union with rb_node and completion_data,
    but the completion data is used by device mapper and can also be used by
    other drivers. So, I didn't see a way around the additional field.

    I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
    the lost performance. Comments and other testers, as always, are
    appreciated.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

21 May, 2011

2 commits

  • This patch merges in a fix that missed 2.6.39 final.

    Conflicts:
    block/blk.h

    Jens Axboe
     
  • Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
    had churn in the core area that makes it difficult to handle
    patches for eg cfq or blk-throttle. Instead of requiring that they
    be based in older versions with bugs that have been fixed later
    in the rc cycle, merge in 2.6.39 final.

    Also fixes up conflicts in the below files.

    Conflicts:
    drivers/block/paride/pcd.c
    drivers/cdrom/viocd.c
    drivers/ide/ide-cd.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 May, 2011

1 commit

  • blk_cleanup_queue() calls elevator_exit() and after this, we can't
    touch the elevator without oopsing. __elv_next_request() must check
    for this state because in the refcounted queue model, we can still
    call it after blk_cleanup_queue() has been called.

    This was reported as causing an oops attributable to scsi.

    Signed-off-by: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    James Bottomley
     

07 May, 2011

1 commit

  • In some drives, flush requests are non-queueable. When flush request is
    running, normal read/write requests can't run. If block layer dispatches
    such request, driver can't handle it and requeue it. Tejun suggested we
    can hold the queue when flush is running. This can avoid unnecessary
    requeue. Also this can improve performance. For example, we have
    request flush1, write1, flush 2. flush1 is dispatched, then queue is
    hold, write1 isn't inserted to queue. After flush1 is finished, flush2
    will be dispatched. Since disk cache is already clean, flush2 will be
    finished very soon, so looks like flush2 is folded to flush1.

    In my test, the queue holding completely solves a regression introduced by
    commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

    which causes about 20% regression running a sysbench fileio
    workload.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com
     

19 Apr, 2011

1 commit

  • We are currently using this flag to check whether it's safe
    to call into ->request_fn(). If it is set, we punt to kblockd.
    But we get a lot of false positives and excessive punts to
    kblockd, which hurts performance.

    The only real abuser of this infrastructure is SCSI. So export
    the async queue run and convert SCSI over to use that. There's
    room for improvement in that SCSI need not always use the async
    call, but this fixes our performance issue and they can fix that
    up in due time.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Mar, 2011

1 commit


21 Mar, 2011

1 commit

  • One of the disadvantages of on-stack plugging is that we potentially
    lose out on merging since all pending IO isn't always visible to
    everybody. When we flush the on-stack plugs, right now we don't do
    any checks to see if potential merge candidates could be utilized.

    Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
    It works just ELEVATOR_INSERT_SORT, but first checks whether we can
    merge with an existing request before doing the insertion (if we fail
    merging).

    This fixes a regression with multiple processes issuing IO that
    can be merged.

    Thanks to Shaohua Li for testing and fixing
    an accounting bug.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Jan, 2011

2 commits

  • The current FLUSH/FUA support has evolved from the implementation
    which had to perform queue draining. As such, sequencing is done
    queue-wide one flush request after another. However, with the
    draining requirement gone, there's no reason to keep the queue-wide
    sequential approach.

    This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
    request is sequenced individually. The actual FLUSH execution is
    double buffered and whenever a request wants to execute one for either
    PRE or POSTFLUSH, it queues on the pending queue. Once certain
    conditions are met, a flush request is issued and on its completion
    all pending requests proceed to the next sequence.

    This allows arbitrary merging of different type of flushes. How they
    are merged can be primarily controlled and tuned by adjusting the
    above said 'conditions' used to determine when to issue the next
    flush.

    This is inspired by Darrick's patches to merge multiple zero-data
    flushes which helps workloads with highly concurrent fsync requests.

    * As flush requests are never put on the IO scheduler, request fields
    used for flush share space with rq->rb_node. rq->completion_data is
    moved out of the union. This increases the request size by one
    pointer.

    As rq->elevator_private* are used only by the iosched too, it is
    possible to reduce the request size further. However, to do that,
    we need to modify request allocation path such that iosched data is
    not allocated for flush requests.

    * FLUSH/FUA processing happens on insertion now instead of dispatch.

    - Comments updated as per Vivek and Mike.

    Signed-off-by: Tejun Heo
    Cc: "Darrick J. Wong"
    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Cc: Vivek Goyal
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • rq == &q->flush_rq was used to determine whether a rq is part of a
    flush sequence, which worked because all requests in a flush sequence
    were sequenced using the single dedicated request. This is about to
    change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
    requests.

    This patch doesn't cause any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Oct, 2010

1 commit


23 Oct, 2010

2 commits

  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     
  • * 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
    cfq-iosched: Fix a gcc 4.5 warning and put some comments
    block: Turn bvec_k{un,}map_irq() into static inline functions
    block: fix accounting bug on cross partition merges
    block: Make the integrity mapped property a bio flag
    block: Fix double free in blk_integrity_unregister
    block: Ensure physical block size is unsigned int
    blkio-throttle: Fix possible multiplication overflow in iops calculations
    blkio-throttle: limit max iops value to UINT_MAX
    blkio-throttle: There is no need to convert jiffies to milli seconds
    blkio-throttle: Fix link failure failure on i386
    blkio: Recalculate the throttled bio dispatch time upon throttle limit change
    blkio: Add root group to td->tg_list
    blkio: deletion of a cgroup was causes oops
    blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: revert bad fix for memory hotplug causing bounces
    Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
    block: set the bounce_pfn to the actual DMA limit rather than to max memory
    block: Prevent hang_check firing during long I/O
    cfq: improve fsync performance for small files
    ...

    Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h

    Linus Torvalds
     

19 Oct, 2010

2 commits

  • Conflicts:
    block/blk-core.c
    drivers/block/loop.c
    mm/swapfile.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    When reloading partition tables, quiesce IO to ensure that no
    request references to the partition struct exists. When it is safe
    to free the partition table, the IO for that device is restarted
    again.

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     

11 Sep, 2010

1 commit

  • Some controllers have a hardware limit on the number of protection
    information scatter-gather list segments they can handle.

    Introduce a max_integrity_segments limit in the block layer and provide
    a new scsi_host_template setting that allows HBA drivers to provide a
    value suitable for the hardware.

    Add support for honoring the integrity segment limit when merging both
    bios and requests.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

10 Sep, 2010

5 commits

  • Now that the backend conversion is complete, export sequenced
    FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
    device cache should be flushed before executing the request. REQ_FUA
    means that the data in the request should be on non-volatile media on
    completion.

    Block layer will choose the correct way of implementing the semantics
    and execute it. The request may be passed to the device directly if
    the device can handle it; otherwise, it will be sequenced using one or
    more proxy requests. Devices will never see REQ_FLUSH and/or FUA
    which it doesn't support.

    Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
    never failed with -EOPNOTSUPP. If the underlying device doesn't
    support FLUSH/FUA, the block layer simply make those noop. IOW, it no
    longer distinguishes between writeback cache which doesn't support
    cache flush and writethrough/no cache. Devices which have WB cache
    w/o flush are very difficult to come by these days and there's nothing
    much we can do anyway, so it doesn't make sense to require everyone to
    implement -EOPNOTSUPP handling. This will simplify filesystems and
    block drivers as they can drop -EOPNOTSUPP retry logic for barriers.

    * QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
    blk-flush.c.

    * REQ_FLUSH w/o data can also be directly passed to drivers without
    sequencing but some drivers assume that zero length requests don't
    have rq->bio which isn't true for these requests requiring the use
    of proxy requests.

    * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
    copied from bio to request.

    * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
    WRITE_FLUSH_FUA are added.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With ordering requirements dropped, barrier and ordered are misnomers.
    Now all block layer does is sequencing FLUSH and FUA. Rename them to
    flush.

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Filesystems will take all the responsibilities for ordering requests
    around commit writes and will only indicate how the commit writes
    themselves should be handled by block layers. This patch drops
    barrier ordering by queue draining from block layer. Ordering by
    draining implementation was somewhat invasive to request handling.
    List of notable changes follow.

    * Each queue has 1 bit color which is flipped on each barrier issue.
    This is used to track whether a given request is issued before the
    current barrier or not. REQ_ORDERED_COLOR flag and coloring
    implementation in __elv_add_request() are removed.

    * Requests which shouldn't be processed yet for draining were stalled
    by returning -EAGAIN from blk_do_ordered() according to the test
    result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
    This logic is removed.

    * Draining completion logic in elv_completed_request() removed.

    * All barrier sequence requests were queued to request queue and then
    trckled to lower layer according to progress and thus maintaining
    request orders during requeue was necessary. This is replaced by
    queueing the next request in the barrier sequence only after the
    current one is complete from blk_ordered_complete_seq(), which
    removes the need for multiple proxy requests in struct request_queue
    and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
    elv_insert().

    * As barriers no longer have ordering constraints, there's no need to
    dump the whole elevator onto the dispatch queue on each barrier.
    Insert barriers at the front instead.

    * If other barrier requests come to the front of the dispatch queue
    while one is already in progress, they are stored in
    q->pending_barriers and restored to dispatch queue one-by-one after
    each barrier completion from blk_ordered_complete_seq().

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make the following cleanups in preparation of barrier/flush update.

    * blk_do_ordered() declaration is moved from include/linux/blkdev.h to
    block/blk.h.

    * blk_do_ordered() now returns pointer to struct request, with %NULL
    meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
    later". The third case will be dropped with further changes.

    * In the initialization of proxy barrier request, data direction is
    already set by init_request_from_bio(). Drop unnecessary explicit
    REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
    flag setting.

    * add_request() is collapsed into __make_request().

    These changes don't make any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • While testing CPU DLPAR, the following problem was discovered.
    We were DLPAR removing the first CPU, which in this case was
    logical CPUs 0-3. CPUs 0-2 were already marked offline and
    we were in the process of offlining CPU 3. After marking
    the CPU inactive and offline in cpu_disable, but before the
    cpu was completely idle (cpu_die), we ended up in __make_request
    on CPU 3. There we looked at the topology map to see which CPU
    to complete the I/O on and found no CPUs in the cpu_sibling_map.
    This resulted in the block layer setting the completion cpu
    to be NR_CPUS, which then caused an oops when we tried to
    complete the I/O.

    Fix this by sanity checking the value we return from blk_cpu_to_group
    to be a valid cpu value.

    Signed-off-by: Brian King
    Signed-off-by: Jens Axboe

    Brian King
     

08 Aug, 2010

1 commit


11 Sep, 2009

1 commit

  • Failfast has characteristics from other attributes. When issuing,
    executing and successuflly completing requests, failfast doesn't make
    any difference. It only affects how a request is handled on failure.
    Allowing requests with different failfast settings to be merged cause
    normal IOs to fail prematurely while not allowing has performance
    penalties as failfast is used for read aheads which are likely to be
    located near in-flight or to-be-issued normal IOs.

    This patch introduces the concept of 'mixed merge'. A request is a
    mixed merge if it is merge of segments which require different
    handling on failure. Currently the only mixable attributes are
    failfast ones (or lack thereof).

    When a bio with different failfast settings is added to an existing
    request or requests of different failfast settings are merged, the
    merged request is marked mixed. Each bio carries failfast settings
    and the request always tracks failfast state of the first bio. When
    the request fails, blk_rq_err_bytes() can be used to determine how
    many bytes can be safely failed without crossing into an area which
    requires further retrials.

    This allows request merging regardless of failfast settings while
    keeping the failure handling correct.

    This patch only implements mixed merge but doesn't enable it. The
    next one will update SCSI to make use of mixed merge.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 May, 2009

1 commit

  • The commit below in 2.6-block/for-2.6.31 causes no diskstat problem
    because the blk_discard_rq() check was added with '&&'.
    It should be 'blk_fs_request() || blk_discard_rq()'.
    This patch does it and fixes the no diskstat problem.
    Please review and apply.

    ------ /proc/diskstat without this patch -------------------------------------
    8 0 sda 0 0 0 0 0 0 0 0 0 0 0
    ------------------------------------------------------------------------------

    ----- /proc/diskstat with this patch applied ---------------------------------
    8 0 sda 4186 303 373621 61600 9578 3859 107468 169479 2 89755 231059
    ------------------------------------------------------------------------------

    --------------------------------------------------------------------------
    commit c69d48540c201394d08cb4d48b905e001313d9b8
    Author: Jens Axboe
    Date: Fri Apr 24 08:12:19 2009 +0200

    block: include discard requests in IO accounting

    We currently don't do merging on discard requests, but we potentially
    could. If we do, then we need to include discard requests in the IO
    accounting, or merging would end up decrementing in_flight IO counters
    for an IO which never incremented them.

    So enable accounting for discard requests.

    static inline int blk_do_io_stat(struct request *rq)
    {
    - return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq);
    + return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq) &&
    + blk_discard_rq(rq);
    }
    --------------------------------------------------------------------------

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     

19 May, 2009

1 commit


11 May, 2009

2 commits

  • Till now block layer allowed two separate modes of request execution.
    A request is always acquired from the request queue via
    elv_next_request(). After that, drivers are free to either dequeue it
    or process it without dequeueing. Dequeue allows elv_next_request()
    to return the next request so that multiple requests can be in flight.

    Executing requests without dequeueing has its merits mostly in
    allowing drivers for simpler devices which can't do sg to deal with
    segments only without considering request boundary. However, the
    benefit this brings is dubious and declining while the cost of the API
    ambiguity is increasing. Segment based drivers are usually for very
    old or limited devices and as converting to dequeueing model isn't
    difficult, it doesn't justify the API overhead it puts on block layer
    and its more modern users.

    Previous patches converted all block low level drivers to dequeueing
    model. This patch completes the API transition by...

    * renaming elv_next_request() to blk_peek_request()

    * renaming blkdev_dequeue_request() to blk_start_request()

    * adding blk_fetch_request() which is combination of peek and start

    * disallowing completion of queued (not started) requests

    * applying new API to all LLDs

    Renamings are for consistency and to break out of tree code so that
    it's apparent that out of tree drivers need updating.

    [ Impact: block request issue API cleanup, no functional change ]

    Signed-off-by: Tejun Heo
    Cc: Rusty Russell
    Cc: James Bottomley
    Cc: Mike Miller
    Cc: unsik Kim
    Cc: Paul Clements
    Cc: Tim Waugh
    Cc: Geert Uytterhoeven
    Cc: David S. Miller
    Cc: Laurent Vivier
    Cc: Jeff Garzik
    Cc: Jeremy Fitzhardinge
    Cc: Grant Likely
    Cc: Adrian McMenamin
    Cc: Stephen Rothwell
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: Sergei Shtylyov
    Cc: Alex Dubov
    Cc: Pierre Ossman
    Cc: David Woodhouse
    Cc: Markus Lidel
    Cc: Stefan Weinhuber
    Cc: Martin Schwidefsky
    Cc: Pete Zaitcev
    Cc: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • struct request has had a few different ways to represent some
    properties of a request. ->hard_* represent block layer's view of the
    request progress (completion cursor) and the ones without the prefix
    are supposed to represent the issue cursor and allowed to be updated
    as necessary by the low level drivers. The thing is that as block
    layer supports partial completion, the two cursors really aren't
    necessary and only cause confusion. In addition, manual management of
    request detail from low level drivers is cumbersome and error-prone at
    the very least.

    Another interesting duplicate fields are rq->[hard_]nr_sectors and
    rq->{hard_cur|current}_nr_sectors against rq->data_len and
    rq->bio->bi_size. This is more convoluted than the hard_ case.

    rq->[hard_]nr_sectors are initialized for requests with bio but
    blk_rq_bytes() uses it only for !pc requests. rq->data_len is
    initialized for all request but blk_rq_bytes() uses it only for pc
    requests. This causes good amount of confusion throughout block layer
    and its drivers and determining the request length has been a bit of
    black magic which may or may not work depending on circumstances and
    what the specific LLD is actually doing.

    rq->{hard_cur|current}_nr_sectors represent the number of sectors in
    the contiguous data area at the front. This is mainly used by drivers
    which transfers data by walking request segment-by-segment. This
    value always equals rq->bio->bi_size >> 9. However, data length for
    pc requests may not be multiple of 512 bytes and using this field
    becomes a bit confusing.

    In general, having multiple fields to represent the same property
    leads only to confusion and subtle bugs. With recent block low level
    driver cleanups, no driver is accessing or manipulating these
    duplicate fields directly. Drop all the duplicates. Now rq->sector
    means the current sector, rq->data_len the current total length and
    rq->bio->bi_size the current segment length. Everything else is
    defined in terms of these three and available only through accessors.

    * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
    now handles pc and fs requests equally other than rq->sector update.
    This means that now pc requests can use partial completion too (no
    in-kernel user yet tho).

    * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
    now uses byte count as the primary data length.

    * blk_rq_pos() is now guranteed to be always correct. In-block users
    converted.

    * blk_rq_bytes() is now guaranteed to be always valid as is
    blk_rq_sectors(). In-block users converted.

    * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
    More convenient one is used.

    * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
    pointer to request.

    [ Impact: API cleanup, single way to represent one property of a request ]

    Signed-off-by: Tejun Heo
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Apr, 2009

3 commits

  • We currently don't do merging on discard requests, but we potentially
    could. If we do, then we need to include discard requests in the IO
    accounting, or merging would end up decrementing in_flight IO counters
    for an IO which never incremented them.

    So enable accounting for discard requests.

    Problem found by Nikanth Karthikesan

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently check for file system requests outside of blk_do_io_stat(rq),
    but we may as well just include it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Impact: code reorganization

    elv_next_request() and elv_dequeue_request() are public block layer
    interface than actual elevator implementation. They mostly deal with
    how requests interact with block layer and low level drivers at the
    beginning of rqeuest processing whereas __elv_next_request() is the
    actual eleveator request fetching interface.

    Move the two functions to blk-core.c. This prepares for further
    interface cleanup.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

24 Apr, 2009

1 commit

  • This simplifies I/O stat accounting switching code and separates it
    completely from I/O scheduler switch code.

    Requests are accounted according to the state of their request queue
    at the time of the request allocation. There is no need anymore to
    flush the request queue when switching I/O accounting state.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

15 Apr, 2009

1 commit


07 Apr, 2009

2 commits