23 Dec, 2015

2 commits

  • This was added for the 'magic' AEN requests in the NVMe driver that never
    return. We now handle them purely inside the driver and don't need this
    core hack any more.

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Timer context is not very useful for drivers to perform any meaningful abort
    action from. So instead of calling the driver from this useless context
    defer it to a workqueue as soon as possible.

    Note that while a delayed_work item would seem the right thing here I didn't
    dare to use it due to the magic in blk_add_timer that pokes deep into timer
    internals. But maybe this encourages Tejun to add a sensible API for that to
    the workqueue API and we'll all be fine in the end :)

    Contains a major update from Keith Bush:

    "This patch removes synchronizing the timeout work so that the timer can
    start a freeze on its own queue. The timer enters the queue, so timer
    context can only start a freeze, but not wait for frozen."

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Nov, 2015

2 commits


08 Jan, 2015

1 commit


23 Sep, 2014

3 commits

  • Signed-off-by: Christoph Hellwig

    Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces
    blk-mq's own timeout handler, and removes following line:

    blk_queue_rq_timed_out(q, blk_mq_rq_timed_out);

    which then causes blk_add_timer() to bypass adding the timer,
    since blk-mq no longer has q->rq_timed_out_fn defined.

    This patch fixes the problem by bypassing the check for blk-mq,
    so that both request deadlines are still set and the rolling
    timer updated.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Duplicate the (small) timeout handler in blk-mq so that we can pass
    arguments more easily to the driver timeout handler. This enables
    the next patch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 May, 2014

1 commit


14 May, 2014

1 commit

  • This adds support for active queue tracking, meaning that the
    blk-mq tagging maintains a count of active users of a tag set.
    This allows us to maintain a notion of fairness between users,
    so that we can distribute the tag depth evenly without starving
    some users while allowing others to try unfair deep queues.

    If sharing of a tag set is detected, each hardware queue will
    track the depth of its own queue. And if this exceeds the total
    depth divided by the number of active queues, the user is actively
    throttled down.

    The active queue count is done lazily to avoid bouncing that data
    between submitter and completer. Each hardware queue gets marked
    active when it allocates its first tag, and gets marked inactive
    when 1) the last tag is cleared, and 2) the queue timeout grace
    period has passed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2014

1 commit


24 Apr, 2014

1 commit

  • If a requeue event races with a timeout, we can get into the
    situation where we attempt to complete a request from the
    timeout handler when it's not start anymore. This causes a crash.
    So have the timeout handler check that REQ_ATOM_STARTED is still
    set on the request - if not, we ignore the event. If this happens,
    the request has now been marked as complete. As a consequence, we
    need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
    as to maintain proper request state.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Apr, 2014

1 commit

  • Since we are now, by default, applying timer slack to expiry times,
    the logic for when to modify a timer in the block code is suboptimal.
    The block layer keeps a forward rolling timer per queue for all
    requests, and modifies this timer if a request has a shorter timeout
    than what the current expiry time is. However, this breaks down
    when our rounded timer values get applied slack. Then each new
    request ends up modifying the timer, since we're still a little
    in front of the timer + slack.

    Fix this by allowing a tolerance of HZ / 2, the timeout handling
    doesn't need to be very precise. This drastically cuts down
    the number of timer modifications we have to make.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Feb, 2014

1 commit

  • Rework I/O completions to work more like the old code path. blk_mq_end_io
    now stays out of the business of deferring completions to others CPUs
    and calling blk_mark_rq_complete. The latter is very important to allow
    completing requests that have timed out and thus are already marked completed,
    the former allows using the IPI callout even for driver specific completions
    instead of having to reimplement them.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Nov, 2013

2 commits


08 Nov, 2013

1 commit

  • crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

    Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
    RIP: 0010:[] [] blk_requeue_request+0x94/0xa0
    RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
    RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
    RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
    RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
    R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
    FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
    Stack:
    0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
    ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
    ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
    Call Trace:
    [] __scsi_queue_insert+0xa3/0x150
    [] ? scsi_eh_ready_devs+0x5e3/0x850
    [] scsi_queue_insert+0x13/0x20
    [] scsi_eh_flush_done_q+0x104/0x160
    [] scsi_error_handler+0x35b/0x660
    [] ? scsi_error_handler+0x0/0x660
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20
    Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
    RIP [] blk_requeue_request+0x94/0xa0
    RSP

    The RIP is this line:
    BUG_ON(blk_queued_rq(rq));

    After digging through the code, I think there may be a race between the
    request completion and the timer handler running.

    A timer is started for each request put on the device's queue (see
    blk_start_request->blk_add_timer). If the request does not complete
    before the timer expires, the timer handler (blk_rq_timed_out_timer)
    will mark the request complete atomically:

    static inline int blk_mark_rq_complete(struct request *rq)
    {
    return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
    }

    and then call blk_rq_timed_out. The latter function will call
    scsi_times_out, which will return one of BLK_EH_HANDLED,
    BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
    returned, blk_clear_rq_complete is called, and blk_add_timer is again
    called to simply wait longer for the request to complete.

    Now, if the request happens to complete while this is going on, what
    happens? Given that we know the completion handler will bail if it
    finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
    handler running after that bit is cleared. So, from the above
    paragraph, after the call to blk_clear_rq_complete. If the completion
    sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
    there (I haven't seen this in the cores). Next, if we get the
    completion before the call to list_add_tail, then the timer will
    eventually fire for an old req, which may either be freed or reallocated
    (there is evidence that this might be the case). Finally, if the
    completion comes in *after* the addition to the timeout list, I think
    it's harmless. The request will be removed from the timeout list,
    req_atom_complete will be set, and all will be well.

    This will only actually explain the coredumps *IF* the request
    structure was freed, reallocated *and* queued before the error handler
    thread had a chance to process it. That is possible, but it may make
    sense to keep digging for another race. I think that if this is what
    was happening, we would see other instances of this problem showing up
    as null pointer or garbage pointer dereferences, for example when the
    request structure was not re-used. It looks like we actually do run
    into that situation in other reports.

    This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
    &req->atomic_flags)); from blk_add_timer to the only caller that could
    trip over it (blk_start_request). It then inverts the calls to
    blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
    the race. I've boot tested this patch, but nothing more.

    Signed-off-by: Jeff Moyer
    Acked-by: Hannes Reinecke
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2013

1 commit


15 Jun, 2012

1 commit

  • This function was only used by btrfs code in btrfs_abort_devices()
    (seems in a wrong way).

    It was removed in commit d07eb9117050c9ed3f78296ebcc06128b52693be,
    So, Let's remove the dead code to avoid any confusion.

    Changes in v2: update commit log, btrfs_abort_devices() was removed
    already.

    Cc: Jens Axboe
    Cc: linux-kernel@vger.kernel.org
    Cc: Chris Mason
    Cc: linux-btrfs@vger.kernel.org
    Cc: David Sterba
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     

04 Aug, 2011

1 commit

  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

21 Apr, 2010

1 commit

  • blk_rq_timed_out_timer() relied on blk_add_timer() never returning a
    timer value of zero, but commit 7838c15b8dd18e78a523513749e5b54bda07b0cb
    removed the code that bumped this value when it was zero.
    Therefore when jiffies is near wrap we could get unlucky & not set the
    timeout value correctly.

    This patch uses a flag to indicate that the timeout value was set and so
    handles jiffies wrap correctly, and it keeps all the logic in one
    function so should be easier to maintain in the future.

    Signed-off-by: Richard Kennedy
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Richard Kennedy
     

28 Apr, 2009

1 commit


24 Apr, 2009

1 commit

  • Very rarely under stress testing of dm, oopses are occuring as
    something tampers with an old stack frame. This has been traced back
    to blk_abort_queue() leaving a timeout_list pointing to the stack.
    The reason is that sometimes blk_abort_request() won't delete the
    timer (if the request is marked as complete but before the timer has
    been removed, a small race window). Fix this by splicing back from
    the ususally empty list to the q->timeout_list.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

22 Apr, 2009

1 commit


18 Feb, 2009

1 commit


29 Dec, 2008

3 commits


06 Nov, 2008

1 commit

  • This patch (as1159b) changes the timeout routines in the block core to
    use round_jiffies_up(). There's no point in rounding the timer
    deadline down, since if it expires too early we will have to restart
    it.

    The patch also removes some unnecessary tests when a request is
    removed from the queue's timer list.

    Signed-off-by: Alan Stern
    Signed-off-by: Jens Axboe

    Alan Stern
     

09 Oct, 2008

4 commits