24 Mar, 2019

1 commit

  • [ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]

    When requeue, if RQF_DONTPREP, rq has contained some driver
    specific data, so insert it to hctx dispatch list to avoid any
    merge. Take scsi as example, here is the trace event log (no
    io scheduler, because RQF_STARTED would prevent merging),

    kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
    scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
    scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
    kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
    scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
    scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
    kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
    kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
    scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

    (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
    Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
    the sdb only contained the part of (32768 + 8), then only that part
    was completed. The lucky thing was that scsi_io_completion detected
    it and requeued the remaining part. So we didn't get corrupted data.
    However, the requeue of (32776 + 8) is not expected.

    Suggested-by: Jens Axboe
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jianchao Wang
     

08 Dec, 2018

2 commits

  • commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.

    After the direct dispatch corruption fix, we permanently disallow direct
    dispatch of non read/write requests. This works fine off the normal IO
    path, as they will be retried like any other failed direct dispatch
    request. But for the blk_insert_cloned_request() that only DM uses to
    bypass the bottom level scheduler, we always first attempt direct
    dispatch. For some types of requests, that's now a permanent failure,
    and no amount of retrying will make that succeed. This results in a
    livelock.

    Instead of making special cases for what we can direct issue, and now
    having to deal with DM solving the livelock while still retaining a BUSY
    condition feedback loop, always just add a request that has been through
    ->queue_rq() to the hardware queue dispatch list. These are safe to use
    as no merging can take place there. Additionally, if requests do have
    prepped data from drivers, we aren't dependent on them not sharing space
    in the request structure to safely add them to the IO scheduler lists.

    This basically reverts ffe81d45322c and is based on a patch from Ming,
    but with the list insert case covered as well.

    Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
    Cc: stable@vger.kernel.org
    Suggested-by: Ming Lei
    Reported-by: Bart Van Assche
    Tested-by: Ming Lei
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.

    If we attempt a direct issue to a SCSI device, and it returns BUSY, then
    we queue the request up normally. However, the SCSI layer may have
    already setup SG tables etc for this particular command. If we later
    merge with this request, then the old tables are no longer valid. Once
    we issue the IO, we only read/write the original part of the request,
    not the new state of it.

    This causes data corruption, and is most often noticed with the file
    system complaining about the just read data being invalid:

    [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

    because most of it is garbage...

    This doesn't happen from the normal issue path, as we will simply defer
    the request to the hardware queue dispatch list if we fail. Once it's on
    the dispatch list, we never merge with it.

    Fix this from the direct issue path by flagging the request as
    REQ_NOMERGE so we don't change the size of it before issue.

    See also:
    https://bugzilla.kernel.org/show_bug.cgi?id=201685

    Tested-by: Guenter Roeck
    Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

28 Sep, 2018

1 commit

  • trace_block_unplug() takes true for explicit unplugs and false for
    implicit unplugs. schedule() unplugs are implicit and should be
    reported as timer unplugs. While correct in the legacy code, this has
    been inverted in blk-mq since 4.11.

    Cc: stable@vger.kernel.org
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

23 Aug, 2018

1 commit

  • Pull more block updates from Jens Axboe:

    - Set of bcache fixes and changes (Coly)

    - The flush warn fix (me)

    - Small series of BFQ fixes (Paolo)

    - wbt hang fix (Ming)

    - blktrace fix (Steven)

    - blk-mq hardware queue count update fix (Jianchao)

    - Various little fixes

    * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
    block/DAC960.c: make some arrays static const, shrinks object size
    blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
    blk-mq: init hctx sched after update ctx and hctx mapping
    block: remove duplicate initialization
    tracing/blktrace: Fix to allow setting same value
    pktcdvd: fix setting of 'ret' error return for a few cases
    block: change return type to bool
    block, bfq: return nbytes and not zero from struct cftype .write() method
    block, bfq: improve code of bfq_bfqq_charge_time
    block, bfq: reduce write overcharge
    block, bfq: always update the budget of an entity when needed
    block, bfq: readd missing reset of parent-entity service
    blk-wbt: fix IO hang in wbt_wait()
    block: don't warn for flush on read-only device
    bcache: add the missing comments for smp_mb()/smp_wmb()
    bcache: remove unnecessary space before ioctl function pointer arguments
    bcache: add missing SPDX header
    bcache: move open brace at end of function definitions to next line
    bcache: add static const prefix to char * array declarations
    bcache: fix code comments style
    ...

    Linus Torvalds
     

21 Aug, 2018

2 commits

  • For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
    account the inflight requests. It will access the queue_hw_ctx and
    nr_hw_queues w/o any protection. When updating nr_hw_queues and
    blk_mq_in_flight/rw occur concurrently, panic comes up.

    Before update nr_hw_queues, the q will be frozen. So we could use
    q_usage_counter to avoid the race. percpu_ref_is_zero is used here
    so that we will not miss any in-flight request. The access to
    nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
    under rcu critical section, __blk_mq_update_nr_hw_queues could use
    synchronize_rcu to ensure the zeroed q_usage_counter to be globally
    visible.

    Signed-off-by: Jianchao Wang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • Currently, when update nr_hw_queues, IO scheduler's init_hctx will
    be invoked before the mapping between ctx and hctx is adapted
    correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
    may depend on this mapping and get wrong result and panic finally.
    A simply way to fix this is that switch the IO scheduler to 'none'
    before update the nr_hw_queues, and then switch it back after
    update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
    to nobody use them any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

09 Aug, 2018

1 commit

  • Currently, we count the hctx as active after allocate driver tag
    successfully. If a previously inactive hctx try to get tag first
    time, it may fails and need to wait. However, due to the stale tag
    ->active_queues, the other shared-tags users are still able to
    occupy all driver tags while there is someone waiting for tag.
    Consequently, even if the previously inactive hctx is waked up, it
    still may not be able to get a tag and could be starved.

    To fix it, we count the hctx as active before try to allocate driver
    tag, then when it is waiting the tag, the other shared-tag users
    will reserve budget for it.

    Reviewed-by: Ming Lei
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

28 Jul, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "Bigger than usual at this time, mostly due to the O_DIRECT corruption
    issue and the fact that I was on vacation last week. This contains:

    - NVMe pull request with two fixes for the FC code, and two target
    fixes (Christoph)

    - a DIF bio reset iteration fix (Greg Edwards)

    - two nbd reply and requeue fixes (Josef)

    - SCSI timeout fixup (Keith)

    - a small series that fixes an issue with bio_iov_iter_get_pages(),
    which ended up causing corruption for larger sized O_DIRECT writes
    that ended up racing with buffered writes (Martin Wilck)"

    * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
    block: reset bi_iter.bi_done after splitting bio
    block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
    blkdev: __blkdev_direct_IO_simple: fix leak in error case
    block: bio_iov_iter_get_pages: fix size of last iovec
    nvmet: only check for filebacking on -ENOTBLK
    nvmet: fixup crash on NULL device path
    scsi: set timed out out mq requests to complete
    blk-mq: export setting request completion state
    nvme: if_ready checks to fail io to deleting controller
    nvmet-fc: fix target sgl list on large transfers
    nbd: handle unexpected replies better
    nbd: don't requeue the same request twice.

    Linus Torvalds
     

25 Jul, 2018

1 commit


23 Jul, 2018

1 commit

  • Inside blk_mq_try_issue_list_directly(), if the request is issued as
    failed, we shouldn't try to do it again, otherwise the warning in
    blk_mq_start_request() will be triggered. This change is aligned to
    behaviour of other ways of request issue & dispatch.

    Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Kashyap Desai
    Cc: kernel test robot
    Cc: LKP
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Jul, 2018

1 commit

  • In case of 'none' io scheduler, when hw queue isn't busy, it isn't
    necessary to enqueue request to sw queue and dequeue it from
    sw queue because request may be submitted to hw queue asap without
    extra cost, meantime there shouldn't be much request in sw queue,
    and we don't need to worry about effect on IO merge.

    There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
    which may connect high performance devices, so 'none' is often required
    for obtaining good performance.

    This patch improves IOPS and decreases CPU unilization on megaraid_sas,
    per Kashyap's test.

    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jul, 2018

10 commits

  • We don't really need to save this stuff in the core block code, we can
    just pass the bio back into the helpers later on to derive the same
    flags and update the rq->wbt_flags appropriately.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • It won't be efficient to dequeue request one by one from sw queue,
    but we have to do that when queue is busy for better merge performance.

    This patch takes the Exponential Weighted Moving Average(EWMA) to figure
    out if queue is busy, then only dequeue request one by one from sw queue
    when queue is busy.

    Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • list_splice_tail_init() is much more faster than inserting each
    request one by one, given all requets in 'list' belong to
    same sw queue and ctx->lock is required to insert requests.

    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Kashyap Desai
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Fix typo in a function blk_mq_alloc_tag_set() comment.
    if if it too large -> if it's too large.

    Signed-off-by: Minwoo Im
    Signed-off-by: Jens Axboe

    Minwoo Im
     
  • set->mq_map is now currently cleared if something goes wrong when
    establishing a queue map in blk-mq-pci.c. It's also cleared before
    updating a queue map in blk_mq_update_queue_map().

    This patch provides an API to clear set->mq_map to make it clear.

    Signed-off-by: Minwoo Im
    Signed-off-by: Jens Axboe

    Minwoo Im
     
  • We have to remove synchronize_rcu() from blk_queue_cleanup(),
    otherwise long delay can be caused during lun probe. For removing
    it, we have to avoid to iterate the set->tag_list in IO path, eg,
    blk_mq_sched_restart().

    This patch reverts 5b79413946d (Revert "blk-mq: don't handle
    TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
    and there isn't any reason to restart all queues in one tags any more,
    see the following reasons:

    1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
    which can wake up queues waiting for driver tag.

    2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
    target or host is ready, but SCSI built-in restart can cover all these well,
    see scsi_end_request(), queue will be rerun after any request initiated from
    this host/target is completed.

    In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
    when running I/O on these 8 luns concurrently.

    Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: linux-scsi@vger.kernel.org
    Reported-by: Andrew Jones
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now hctx->lock is only acquired when adding hctx->dispatch_wait to
    one wait queue, but not held when removing it from the wait queue.

    IO hang can be observed easily if SCHED RESTART is disabled, that means
    now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().

    This patch fixes the issue by introducing hctx->dispatch_wait_lock and
    holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
    since we need to avoid acquiring hctx->lock in irq context.

    Fixes: eb619fdb2d4cb8b3d3419 ("blk-mq: fix issue with shared tag queue re-running")
    Cc: Christoph Hellwig
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • 'hctx' won't be changed at all, so not necessary to pass
    '**hctx' to blk_mq_mark_tag_wait().

    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Tested-by: Andrew Jones
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
    we never change '**hctx' as well. The last use of these went away
    with the flush cleanup, commit 0c2a6fe4dc3e.

    So cleanup the usage and remove the two extra parameters.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jul, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "Small set of fixes for this series. Mostly just minor fixes, the only
    oddball in here is the sg change.

    The sg change came out of the stall fix for NVMe, where we added a
    mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
    sort-of ruins that, since we'd need to account for that. That's
    actually a generic problem, since lots of drivers need to allocate SG
    lists. So this just removes support for CONFIG_SG_DEBUG, which I added
    back in 2007 and to my knowledge it was never useful.

    Anyway, outside of that, this pull contains:

    - clone of request with special payload fix (Bart)

    - drbd discard handling fix (Bart)

    - SATA blk-mq stall fix (me)

    - chunk size fix (Keith)

    - double free nvme rdma fix (Sagi)"

    * tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
    sg: remove ->sg_magic member
    drbd: Fix drbd_request_prepare() discard handling
    blk-mq: don't queue more if we get a busy return
    block: Fix cloning of requests with a special payload
    nvme-rdma: fix possible double free of controller async event buffer
    block: Fix transfer when chunk sectors exceeds max

    Linus Torvalds
     

29 Jun, 2018

1 commit

  • Some devices have different queue limits depending on the type of IO. A
    classic case is SATA NCQ, where some commands can queue, but others
    cannot. If we have NCQ commands inflight and encounter a non-queueable
    command, the driver returns busy. Currently we attempt to dispatch more
    from the scheduler, if we were able to queue some commands. But for the
    case where we ended up stopping due to BUSY, we should not attempt to
    retrieve more from the scheduler. If we do, we can get into a situation
    where we attempt to queue a non-queueable command, get BUSY, then
    successfully retrieve more commands from that scheduler and queue those.
    This can repeat forever, starving the non-queuable command indefinitely.

    Fix this by NOT attempting to pull more commands from the scheduler, if
    we get a BUSY return. This should also be more optimal in terms of
    letting requests stay in the scheduler for as long as possible, if we
    get a BUSY due to the regular out-of-tags condition.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2018

2 commits

  • Pull block fixes from Jens Axboe:

    - Further timeout fixes. We aren't quite there yet, so expect another
    round of fixes for that to completely close some of the IRQ vs
    completion races. (Christoph/Bart)

    - Set of NVMe fixes from the usual suspects, mostly error handling

    - Two off-by-one fixes (Dan)

    - Another bdi race fix (Jan)

    - Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)

    * tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
    blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
    bdi: Fix another oops in wb_workfn()
    lightnvm: Remove depends on HAS_DMA in case of platform dependency
    nvme-pci: limit max IO size and segments to avoid high order allocations
    nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
    nvme-fc: release io queues to allow fast fail
    nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
    block: sed-opal: Fix a couple off by one bugs
    blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
    nvmet: reset keep alive timer in controller enable
    nvme-rdma: don't override opts->queue_size
    nvme-rdma: Fix command completion race at error recovery
    nvme-rdma: fix possible free of a non-allocated async event buffer
    nvme-rdma: fix possible double free condition when failing to create a controller
    Revert "block: Add warning for bi_next not NULL in bio_endio()"
    block: fix timeout changes for legacy request drivers

    Linus Torvalds
     
  • Make sure that RQF_TIMED_OUT is cleared when a request is reused
    after a block driver timeout handler has returned BLK_EH_DONE.

    Fixes: da6612673988 ("blk-mq: don't time out requests again that are in the timeout handler")
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Andrew Randrianasulu
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

17 Jun, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "A collection of fixes that should go into -rc1. This contains:

    - bsg_open vs bsg_unregister race fix (Anatoliy)

    - NVMe pull request from Christoph, with fixes for regressions in
    this window, FC connect/reconnect path code unification, and a
    trace point addition.

    - timeout fix (Christoph)

    - remove a few unused functions (Christoph)

    - blk-mq tag_set reinit fix (Roman)"

    * tag 'for-linus-20180616' of git://git.kernel.dk/linux-block:
    bsg: fix race of bsg_open and bsg_unregister
    block: remov blk_queue_invalidate_tags
    nvme-fabrics: fix and refine state checks in __nvmf_check_ready
    nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
    nvme-fabrics: refactor queue ready check
    blk-mq: remove blk_mq_tagset_iter
    nvme: remove nvme_reinit_tagset
    nvme-fc: fix nulling of queue data on reconnect
    nvme-fc: remove reinit_request routine
    blk-mq: don't time out requests again that are in the timeout handler
    nvme-fc: change controllers first connect to use reconnect path
    nvme: don't rely on the changed namespace list log
    nvmet: free smart-log buffer after use
    nvme-rdma: fix error flow during mapping request data
    nvme: add bio remapping tracepoint
    nvme: fix NULL pointer dereference in nvme_init_subsystem
    blk-mq: reinit q->tag_set_list entry only after grace period

    Linus Torvalds
     

14 Jun, 2018

1 commit


13 Jun, 2018

1 commit

  • The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
    patch replaces cases of:

    kzalloc_node(a * b, gfp, node)

    with:
    kcalloc_node(a * b, gfp, node)

    as well as handling cases of:

    kzalloc_node(a * b * c, gfp, node)

    with:

    kzalloc_node(array3_size(a, b, c), gfp, node)

    as it's slightly less ugly than:

    kcalloc_node(array_size(a, b), c, gfp, node)

    This does, however, attempt to ignore constant size factors like:

    kzalloc_node(4 * 1024, gfp, node)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc_node(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc_node(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc_node(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc_node(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc_node
    + kcalloc_node
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc_node(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc_node(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc_node(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc_node(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc_node(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc_node(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc_node(C1 * C2 * C3, ...)
    |
    kzalloc_node(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc_node(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc_node(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc_node(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc_node(sizeof(THING) * C2, ...)
    |
    kzalloc_node(sizeof(TYPE) * C2, ...)
    |
    kzalloc_node(C1 * C2 * C3, ...)
    |
    kzalloc_node(C1 * C2, ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc_node
    + kcalloc_node
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

11 Jun, 2018

1 commit

  • It is not allowed to reinit q->tag_set_list list entry while RCU grace
    period has not completed yet, otherwise the following soft lockup in
    blk_mq_sched_restart() happens:

    [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
    [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
    [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
    [ 1064.256510] Call Trace:
    [ 1064.256664]
    [ 1064.256824] blk_mq_free_request+0xea/0x100
    [ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
    [ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
    [ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
    [ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
    [ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
    [ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
    [ 1064.258007] irq_poll_softirq+0xb7/0xe0
    [ 1064.258165] __do_softirq+0x106/0x2a2
    [ 1064.258328] irq_exit+0x92/0xa0
    [ 1064.258509] do_IRQ+0x4a/0xd0
    [ 1064.258660] common_interrupt+0x7a/0x7a
    [ 1064.258818]

    Meanwhile another context frees other queue but with the same set of
    shared tags:

    [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
    [ 1288.201833] bash D 0 5910 5820 0x00000000
    [ 1288.202016] Call Trace:
    [ 1288.202315] schedule+0x32/0x80
    [ 1288.202462] schedule_timeout+0x1e5/0x380
    [ 1288.203838] wait_for_completion+0xb0/0x120
    [ 1288.204137] __wait_rcu_gp+0x125/0x160
    [ 1288.204287] synchronize_sched+0x6e/0x80
    [ 1288.204770] blk_mq_free_queue+0x74/0xe0
    [ 1288.204922] blk_cleanup_queue+0xc7/0x110
    [ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
    [ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
    [ 1288.205548] kernfs_fop_write+0x109/0x180
    [ 1288.206328] vfs_write+0xb3/0x1a0
    [ 1288.206476] SyS_write+0x52/0xc0
    [ 1288.206624] do_syscall_64+0x68/0x1d0
    [ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    What happened is the following:

    1. There are several MQ queues with shared tags.
    2. One queue is about to be freed and now task is in
    blk_mq_del_queue_tag_set().
    3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
    tag list in order to find hctx to restart.

    Because linked list entry was modified in blk_mq_del_queue_tag_set()
    without proper waiting for a grace period, blk_mq_sched_restart()
    never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

    Fix is simple: reinit list entry after an RCU grace period elapsed.

    Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Cc: stable@vger.kernel.org
    Cc: Sagi Grimberg
    Cc: linux-block@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Roman Pen
    Signed-off-by: Jens Axboe

    Roman Pen
     

05 Jun, 2018

1 commit

  • If a hardware queue is stopped, it should not be run again before
    explicitly started. Ignore stopped queues in blk_mq_run_work_fn(),
    fixing a regression recently introduced when the START_ON_RUN bit
    was removed.

    Fixes: 15fe8a90bb45 ("blk-mq: remove blk_mq_delay_queue()")
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

01 Jun, 2018

2 commits


29 May, 2018

5 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
    is not what is happening - instead the driver already completed the
    command. Fix the symbolic name to reflect that a little better.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • The block layer had been setting the state to in-flight prior to updating
    the timer. This is the wrong order since the timeout handler could observe
    the in-flight state with the older timeout, believing the request had
    expired when in fact it is just getting started.

    Signed-off-by: Keith Busch
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Keith Busch
     

22 May, 2018

1 commit