21 Jul, 2016

1 commit


06 May, 2015

1 commit


23 Sep, 2014

1 commit

  • We should not insert requests into the flush state machine from
    blk_mq_insert_request. All incoming flush requests come through
    blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
    should only be called for BLOCK_PC requests. All other callers
    deal with requests that already went through the flush statemchine
    and shouldn't be reinserted into it.

    Reported-by: Robert Elliott
    Debugged-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jun, 2014

1 commit


22 Feb, 2014

1 commit


08 Feb, 2014

1 commit


01 Jan, 2014

1 commit


25 Oct, 2013

2 commits

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This reference count has been around since before git history, but the only
    place where it's used is in blk_execute_rq, and ther it is entirely useless
    as it is incremented before submitting the request and decremented in the
    end_io handler before waking up the submitter thread.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Sep, 2013

1 commit


01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

15 Feb, 2013

1 commit

  • Using wait_for_completion() for waiting for a IO request to be executed
    results in wrong iowait time accounting. For example, a system having
    the only task doing write() and fdatasync() on a block device can be
    reported being idle instead of iowaiting as it should because
    blkdev_issue_flush() calls wait_for_completion() which in turn calls
    schedule() that does not increment the iowait proc counter and thus does
    not turn on iowait time accounting.

    The patch makes block layer use wait_for_completion_io() instead of
    wait_for_completion() where appropriate to account iowait time
    correctly.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Jens Axboe

    Vladimir Davydov
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

18 Dec, 2012

1 commit

  • Pull block layer core updates from Jens Axboe:
    "Here are the core block IO bits for 3.8. The branch contains:

    - The final version of the surprise device removal fixups from Bart.

    - Don't hide EFI partitions under advanced partition types. It's
    fairly wide spread these days. This is especially dangerous for
    systems that have both msdos and efi partition tables, where you
    want to keep them in sync.

    - Cleanup of using -1 instead of the proper NUMA_NO_NODE

    - Export control of bdi flusher thread CPU mask and default to using
    the home node (if known) from Jeff.

    - Export unplug tracepoint for MD.

    - Core improvements from Shaohua. Reinstate the recursive merge, as
    the original bug has been fixed. Add plugging for discard and also
    fix a problem handling non pow-of-2 discard limits.

    There's a trivial merge in block/blk-exec.c due to a fix that went
    into 3.7-rc at a later point than -rc4 where this is based."

    * 'for-3.8/core' of git://git.kernel.dk/linux-block:
    block: export block_unplug tracepoint
    block: add plug for blkdev_issue_discard
    block: discard granularity might not be power of 2
    deadline: Allow 0ms deadline latency, increase the read speed
    partitions: enable EFI/GPT support by default
    bsg: Remove unused function bsg_goose_queue()
    block: Make blk_cleanup_queue() wait until request_fn finished
    block: Avoid scheduling delayed work on a dead queue
    block: Avoid that request_fn is invoked on a dead queue
    block: Let blk_drain_queue() caller obtain the queue lock
    block: Rename queue dead flag
    bdi: add a user-tunable cpu_list for the bdi flusher threads
    block: use NUMA_NO_NODE instead of -1
    block: recursive merge requests
    block CFQ: avoid moving request to different queue

    Linus Torvalds
     

06 Dec, 2012

2 commits

  • A block driver may start cleaning up resources needed by its
    request_fn as soon as blk_cleanup_queue() finished, so request_fn
    must not be invoked after draining finished. This is important
    when blk_run_queue() is invoked without any requests in progress.
    As an example, if blk_drain_queue() and scsi_run_queue() run in
    parallel, blk_drain_queue() may have finished all requests after
    scsi_run_queue() has taken a SCSI device off the starved list but
    before that last function has had a chance to run the queue.

    Signed-off-by: Bart Van Assche
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Chanho Min
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

23 Nov, 2012

1 commit

  • After we've done __elv_add_request() and __blk_run_queue() in
    blk_execute_rq_nowait(), the request might finish and be freed
    immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
    isn't safe afterwards, because if it isn't, rq might be gone.
    Instead, check beforehand and stash the result in a temporary.

    This fixes crashes in blk_execute_rq_nowait() I get occasionally when
    running with lots of memory debugging options enabled -- I think this
    race is usually harmless because the window for rq to be reallocated
    is so small.

    Signed-off-by: Roland Dreier
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Roland Dreier
     

20 Jul, 2012

1 commit

  • If the queue is dead blk_execute_rq_nowait() doesn't invoke the done()
    callback function. That will result in blk_execute_rq() being stuck
    in wait_for_completion(). Avoid this by initializing rq->end_io to the
    done() callback before we check the queue state. Also, make sure the
    queue lock is held around the invocation of the done() callback. Found
    this through source code review.

    Signed-off-by: Muthukumar Ratty
    Signed-off-by: Bart Van Assche
    Reviewed-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: James Bottomley

    Muthukumar Ratty
     

14 Dec, 2011

2 commits

  • blk_insert_cloned_request(), blk_execute_rq_nowait() and
    blk_flush_plug_list() either didn't check whether the queue was dead
    or did it without holding queue_lock. Update them so that dead state
    is checked while holding queue_lock.

    AFAICS, this plugs all holes (requeue doesn't matter as the request is
    transitioning atomically from in_flight to queued).

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
    macro and use it.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 Jul, 2011

1 commit

  • USB surprise removal of sr is triggering an oops in
    scsi_dispatch_command(). What seems to be happening is that USB is
    hanging on to a queue reference until the last close of the upper
    device, so the crash is caused by surprise remove of a mounted CD
    followed by attempted unmount.

    The problem is that USB doesn't issue its final commands as part of
    the SCSI teardown path, but on last close when the block queue is long
    gone. The long term fix is probably to make sr do the teardown in the
    same way as sd (so remove all the lower bits on ejection, but keep the
    upper disk alive until last close of user space). However, the
    current oops can be simply fixed by not allowing any commands to be
    sent to a dead queue.

    Cc: stable@kernel.org
    Signed-off-by: James Bottomley

    James Bottomley
     

06 May, 2011

1 commit


18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Mar, 2011

2 commits


24 Sep, 2010

1 commit

  • During long I/O operations, the hang_check timer may fire,
    trigger stack dumps that unnecessarily alarm the user.

    Eg. hdparm --security-erase NULL /dev/sdb ## can take *hours* to complete

    So, if hang_check is armed, we should wake up periodically
    to prevent it from triggering. This patch uses a wake-up interval
    equal to half the hang_check timer period, which keeps overhead low enough.

    Signed-off-by: Mark Lord
    Signed-off-by: Jens Axboe

    Mark Lord
     

08 Aug, 2010

1 commit


28 Apr, 2009

1 commit

  • RQ_NOMERGE_FLAGS already clears defines which REQ flags aren't
    mergeable. There is no reason to specify it superflously. It only
    adds to confusion. Don't set REQ_NOMERGE for barriers and requests
    with specific queueing directive. REQ_NOMERGE is now exclusively used
    by the merging code.

    [ Impact: cleanup ]

    Signed-off-by: Tejun Heo

    Tejun Heo
     

09 Oct, 2008

1 commit


16 Jul, 2008

2 commits

  • All the users of blk_end_sync_rq has gone (they are converted to use
    blk_execute_rq). This unexports blk_end_sync_rq.

    Signed-off-by: FUJITA Tomonori
    Cc: Borislav Petkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Bartlomiej Zolnierkiewicz

    FUJITA Tomonori
     
  • For blk_pm_resume_request() requests (which are used only by IDE subsystem
    currently) the queue is stopped so we need to call ->request_fn explicitly.

    Thanks to:
    - Rafael for reporting/bisecting the bug
    - Borislav/Rafael for testing the fix

    This is a preparation for converting IDE to use blk_execute_rq().

    Cc: FUJITA Tomonori
    Cc: Borislav Petkov
    Cc: Jens Axboe
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Bartlomiej Zolnierkiewicz
     

01 Feb, 2008

1 commit


30 Jan, 2008

1 commit