26 Aug, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Jun, 2017

1 commit

  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jun, 2017

1 commit

  • Currently we use nornal Linux errno values in the block layer, and while
    we accept any error a few have overloaded magic meanings. This patch
    instead introduces a new blk_status_t value that holds block layer specific
    status codes and explicitly explains their meaning. Helpers to convert from
    and to the previous special meanings are provided for now, but I suspect
    we want to get rid of them in the long run - those drivers that have a
    errno input (e.g. networking) usually get errnos that don't know about
    the special block layer overloads, and similarly returning them to userspace
    will usually return somethings that strictly speaking isn't correct
    for file system operations, but that's left as an exercise for later.

    For now the set of errors is a very limited set that closely corresponds
    to the previous overloaded errno values, but there is some low hanging
    fruite to improve it.

    blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
    typechecking, so that we can easily catch places passing the wrong values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Apr, 2017

1 commit


25 Mar, 2017

1 commit


18 Feb, 2017

1 commit

  • For blk-mq with scheduling, we can potentially end up with ALL
    driver tags assigned and sitting on the flush queues. If we
    defer because of an inlfight data request, then we can deadlock
    if that data request doesn't already have a tag assigned.

    This fixes a deadlock with running the xfs/297 xfstest, where
    thousands of syncs can cause the drive queue to stall.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

01 Feb, 2017

1 commit

  • Instead of keeping two levels of indirection for requests types, fold it
    all into the operations. The little caveat here is that previously
    cmd_type only applied to struct request, while the request and bio op
    fields were set to plain REQ_OP_READ/WRITE even for passthrough
    operations.

    Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
    private requests, althought it has to add two for each so that we
    can communicate the data in/out nature of the request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jan, 2017

2 commits


18 Jan, 2017

1 commit

  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

10 Dec, 2016

1 commit


09 Nov, 2016

1 commit

  • If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
    depending on flush settings. Since op_is_sync() factors those flags
    in for deciding whether this request is sync or not, we should
    set REQ_SYNC to avoid screwing up this accounting.

    This should be less fragile.

    Reported-by: Logan Gunthorpe
    Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2016

1 commit

  • Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
    are followed by kicking the requeue list. Hence add an argument to
    these two functions that allows to kick the requeue list. This was
    proposed by Christoph Hellwig.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Nov, 2016

1 commit


28 Oct, 2016

2 commits

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A lot of the REQ_* flags are only used on struct requests, and only of
    use to the block layer and a few drivers that dig into struct request
    internals.

    This patch adds a new req_flags_t rq_flags field to struct request for
    them, and thus dramatically shrinks the number of common requests. It
    also removes the unfortunate situation where we have to fit the fields
    from the same enum into 32 bits for struct bio and 64 bits for
    struct request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Oct, 2016

1 commit


15 Sep, 2016

1 commit

  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jun, 2016

4 commits

  • To avoid confusion between REQ_OP_FLUSH, which is handled by
    request_fn drivers, and upper layers requesting the block layer
    perform a flush sequence along with possibly a WRITE, this patch
    renames REQ_FLUSH to REQ_PREFLUSH.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This adds a REQ_OP_FLUSH operation that is sent to request_fn
    based drivers by the block layer's flush code, instead of
    sending requests with the request->cmd_flags REQ_FLUSH bit set.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
    instead of passing it in. This makes that use the same as
    generic_make_request and how we set the other bio fields.

    Signed-off-by: Mike Christie

    Fixed up fs/ext4/crypto.c

    Signed-off-by: Jens Axboe

    Mike Christie
     

14 Apr, 2016

1 commit


26 Nov, 2015

1 commit

  • This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

    Jan writes:

    --

    Thanks for report! After some investigation I found out we allocate
    elevator specific data in __get_request() only for non-flush requests. And
    this is actually required since the flush machinery uses the space in
    struct request for something else. Doh. So my patch is just wrong and not
    easy to fix since at the time __get_request() is called we are not sure
    whether the flush machinery will be used in the end. Jens, please revert
    1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

    I'm somewhat surprised that you can reliably hit the race where flushing
    gets disabled for the device just while the request is in flight. But I
    guess during boot it makes some sense.

    --

    So let's just revert it, we can fix the queue run manually after the
    fact. This race is rare enough that it didn't trigger in testing, it
    requires the specific disable-while-in-flight scenario to trigger.

    Jens Axboe
     

17 Nov, 2015

1 commit

  • Currently blk_insert_flush() just adds flush request to q->queue_head
    when flush is not required. That completely bypasses IO scheduler so
    e.g. CFQ can be idling waiting for new request to arrive and will idle
    through the whole window unnecessarily. Luckily this only happens in
    rare cases as usually checks in generic_make_request_checks() clear
    FLUSH and FUA flags early if they are not needed.

    When no flushing is actually required, we can easily fix the problem by
    properly queueing the request through the IO scheduler. Ideally IO
    scheduler should be also made aware of requests queued via
    blk_flush_queue_rq(). However inserting flush request through IO
    scheduler can have unwanted side-effects since due to flush batching
    delaying the flush request in IO scheduler will delay all flush requests
    possibly coming from other processes. So we keep adding the request
    directly to q->queue_head.

    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jan Kara
     

15 Aug, 2015

1 commit

  • Inside timeout handler, blk_mq_tag_to_rq() is called
    to retrieve the request from one tag. This way is obviously
    wrong because the request can be freed any time and some
    fiedds of the request can't be trusted, then kernel oops
    might be triggered[1].

    Currently wrt. blk_mq_tag_to_rq(), the only special case is
    that the flush request can share same tag with the request
    cloned from, and the two requests can't be active at the same
    time, so this patch fixes the above issue by updating tags->rqs[tag]
    with the active request(either flush rq or the request cloned
    from) of the tag.

    Also blk_mq_tag_to_rq() gets much simplified with this patch.

    Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
    make sure the request can't be freed, so in bt_for_each() this
    helper is replaced with tags->rqs[tag].

    [1] kernel oops log
    [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
    [ 439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
    [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
    [ 439.700653] Dumping ftrace buffer:^M
    [ 439.700653] (ftrace buffer empty)^M
    [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
    [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
    [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
    [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
    [ 439.730500] RIP: 0010:[] [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
    [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
    [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
    [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
    [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
    [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
    [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
    [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
    [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
    [ 439.730500] Stack:^M
    [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
    [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
    [ 439.755663] Call Trace:^M
    [ 439.755663] ^M
    [ 439.755663] [] bt_for_each+0x6e/0xc8^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] blk_mq_tag_busy_iter+0x55/0x5e^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] blk_mq_rq_timer+0x5d/0xd4^M
    [ 439.755663] [] call_timer_fn+0xf7/0x284^M
    [ 439.755663] [] ? call_timer_fn+0x5/0x284^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] run_timer_softirq+0x1ce/0x1f8^M
    [ 439.755663] [] __do_softirq+0x181/0x3a4^M
    [ 439.755663] [] irq_exit+0x40/0x94^M
    [ 439.755663] [] smp_apic_timer_interrupt+0x33/0x3e^M
    [ 439.755663] [] apic_timer_interrupt+0x84/0x90^M
    [ 439.755663] ^M
    [ 439.755663] [] ? _raw_spin_unlock_irq+0x32/0x4a^M
    [ 439.755663] [] finish_task_switch+0xe0/0x163^M
    [ 439.755663] [] ? finish_task_switch+0xa2/0x163^M
    [ 439.755663] [] __schedule+0x469/0x6cd^M
    [ 439.755663] [] schedule+0x82/0x9a^M
    [ 439.789267] [] signalfd_read+0x186/0x49a^M
    [ 439.790911] [] ? wake_up_q+0x47/0x47^M
    [ 439.790911] [] __vfs_read+0x28/0x9f^M
    [ 439.790911] [] ? __fget_light+0x4d/0x74^M
    [ 439.790911] [] vfs_read+0x7a/0xc6^M
    [ 439.790911] [] SyS_read+0x49/0x7f^M
    [ 439.790911] [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
    [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
    f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
    53 38 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
    ^M
    [ 439.790911] RIP [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.790911] RSP ^M
    [ 439.790911] CR2: 0000000000000158^M
    [ 439.790911] ---[ end trace d40af58949325661 ]---^M

    Cc:
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Sep, 2014

9 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
    so that this function can find the corresponding blk_flush_queue
    bound with current mq context since the flush queue will become
    per hw-queue.

    For legacy queue, the parameter can be simply 'NULL'.

    For multiqueue case, the parameter should be set as the context
    from which the related request is originated. With this context
    info, the hw queue and related flush queue can be found easily.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Just figuring out flush queue at the entry of kicking off flush
    machinery and request's completion handler, then pass it through.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch introduces 'struct blk_flush_queue' and puts all
    flush machinery related fields into this structure, so that

    - flush implementation details aren't exposed to driver
    - it is easy to convert to per dispatch-queue flush machinery

    This patch is basically a mechanical replacement.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch trys to use local variable to access flush request,
    so that we can convert to per-queue flush machinery a bit easier.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These fields are always used with the flush request, so
    initialize them together.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is reasonable to allocate flush req in blk_mq_init_flush().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Sep, 2014

2 commits

  • This patch removes two unnecessary blk_clear_rq_complete(),
    the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
    so:

    - The blk_clear_rq_complete() in blk_flush_restore_request()
    needn't because the request will be freed later, and clearing
    it here may open a small race window with timeout.

    - The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
    necessary too, even though REQ_ATOM_STARTED is cleared in
    __blk_mq_requeue_request(), in theory it still may cause a small
    race window with timeout since the two clear_bit() may be
    reordered.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now that we've changed the driver API on the submission side use the
    opportunity to fix up the name on the completion side to fit into the
    general scheme.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Jun, 2014

1 commit