04 May, 2019

4 commits

  • In normal queue cleanup path, hctx is released after request queue
    is freed, see blk_mq_release().

    However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
    of hw queues shrinking. This way is easy to cause use-after-free,
    because: one implicit rule is that it is safe to call almost all block
    layer APIs if the request queue is alive; and one hctx may be retrieved
    by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
    finally use-after-free is triggered.

    Fixes this issue by always freeing hctx after releasing request queue.
    If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
    a per-queue list to hold them, then try to resuse these hctxs if numa
    node is matched.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Split blk_mq_alloc_and_init_hctx into two parts, and one is
    blk_mq_alloc_hctx() for allocating all hctx resources, another
    is blk_mq_init_hctx() for initializing hctx, which serves as
    counter-part of blk_mq_exit_hctx().

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K . Petersen
    Cc: Christoph Hellwig
    Cc: James E . J . Bottomley
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • With holding queue's kobject refcount, it is safe for driver
    to schedule requeue. However, blk_mq_kick_requeue_list() may
    be called after blk_sync_queue() is done because of concurrent
    requeue activities, then requeue work may not be completed when
    freeing queue, and kernel oops is triggered.

    So moving the cancel of requeue_work into blk_mq_release() for
    avoiding race between requeue and freeing queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 May, 2019

1 commit


01 May, 2019

1 commit


14 Apr, 2019

1 commit

  • A previous commit moved the shallow depth and BFQ depth map calculations
    to be done at init time, moving it outside of the hotter IO path. This
    potentially causes hangs if the users changes the depth of the scheduler
    map, by writing to the 'nr_requests' sysfs file for that device.

    Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
    the depth changes, so that the scheduler can update its internal state.

    Tested-by: Kai Krakow
    Reported-by: Paolo Valente
    Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • In NVMe's error handler, follows the typical steps of tearing down
    hardware for recovering controller:

    1) stop blk_mq hw queues
    2) stop the real hw queues
    3) cancel in-flight requests via
    blk_mq_tagset_busy_iter(tags, cancel_request, ...)
    cancel_request():
    mark the request as abort
    blk_mq_complete_request(req);
    4) destroy real hw queues

    However, there may be race between #3 and #4, because blk_mq_complete_request()
    may run q->mq_ops->complete(rq) remotelly and asynchronously, and
    ->complete(rq) may be run after #4.

    This patch introduces blk_mq_complete_request_sync() for fixing the
    above race.

    Cc: Sagi Grimberg
    Cc: Bart Van Assche
    Cc: James Smart
    Cc: linux-nvme@lists.infradead.org
    Reviewed-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 Apr, 2019

1 commit


02 Apr, 2019

2 commits


26 Mar, 2019

1 commit

  • We now wrap sbitmap waitqueues in an active counter, so we can avoid
    iterating wakeups unless we have waiters there. This works as long as
    everyone that's manipulating the waitqueues use the proper helpers. For
    the tag wait case for shared tags, however, we add ourselves to the
    waitqueue without incrementing/decrementing the ->ws_active count. This
    means that wakeups can take a long time to happen.

    Fix this by manually doing the inc/dec as needed for the wait queue
    handling.

    Reported-by: Michael Leun
    Tested-by: Michael Leun
    Cc: stable@vger.kernel.org
    Reviewed-by: Omar Sandoval
    Fixes: 5d2ee7122c73 ("sbitmap: optimize wakeup check")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Mar, 2019

1 commit


21 Mar, 2019

2 commits


18 Mar, 2019

1 commit


10 Mar, 2019

1 commit

  • Pull SCSI updates from James Bottomley:
    "This is mostly update of the usual drivers: arcmsr, qla2xxx, lpfc,
    hisi_sas, target/iscsi and target/core.

    Additionally Christoph refactored gdth as part of the dma changes. The
    major mid-layer change this time is the removal of bidi commands and
    with them the whole of the osd/exofs driver and filesystem. This is a
    major simplification for block and mq in particular"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (240 commits)
    scsi: cxgb4i: validate tcp sequence number only if chip version pf
    scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
    scsi: mpt3sas: Add missing breaks in switch statements
    scsi: aacraid: Fix missing break in switch statement
    scsi: kill command serial number
    scsi: csiostor: drop serial_number usage
    scsi: mvumi: use request tag instead of serial_number
    scsi: dpt_i2o: remove serial number usage
    scsi: st: osst: Remove negative constant left-shifts
    scsi: ufs-bsg: Allow reading descriptors
    scsi: ufs: Allow reading descriptor via raw upiu
    scsi: ufs-bsg: Change the calling convention for write descriptor
    scsi: ufs: Remove unused device quirks
    Revert "scsi: ufs: disable vccq if it's not needed by UFS device"
    scsi: megaraid_sas: Remove a bunch of set but not used variables
    scsi: clean obsolete return values of eh_timed_out
    scsi: sd: Optimal I/O size should be a multiple of physical block size
    scsi: MAINTAINERS: SCSI initiator and target tweaks
    scsi: fcoe: make use of fip_mode enum complete
    ...

    Linus Torvalds
     

09 Mar, 2019

1 commit

  • Pull block layer updates from Jens Axboe:
    "Not a huge amount of changes in this round, the biggest one is that we
    finally have Mings multi-page bvec support merged. Apart from that,
    this pull request contains:

    - Small series that avoids quiescing the queue for sysfs changes that
    match what we currently have (Aleksei)

    - Series of bcache fixes (via Coly)

    - Series of lightnvm fixes (via Mathias)

    - NVMe pull request from Christoph. Nothing major, just SPDX/license
    cleanups, RR mp policy (Hannes), and little fixes (Bart,
    Chaitanya).

    - BFQ series (Paolo)

    - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
    for the fast path (Jianchao)

    - fops->iopoll() added for async IO polling, this is a feature that
    the upcoming io_uring interface will use (Christoph, me)

    - Partition scan loop fixes (Dongli)

    - mtip32xx conversion from managed resource API (Christoph)

    - cdrom registration race fix (Guenter)

    - MD pull from Song, two minor fixes.

    - Various documentation fixes (Marcos)

    - Multi-page bvec feature. This brings a lot of nice improvements
    with it, like more efficient splitting, larger IOs can be supported
    without growing the bvec table size, and so on. (Ming)

    - Various little fixes to core and drivers"

    * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
    block: fix updating bio's front segment size
    block: Replace function name in string with __func__
    nbd: propagate genlmsg_reply return code
    floppy: remove set but not used variable 'q'
    null_blk: fix checking for REQ_FUA
    block: fix NULL pointer dereference in register_disk
    fs: fix guard_bio_eod to check for real EOD errors
    blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
    block: optimize bvec iteration in bvec_iter_advance
    block: introduce mp_bvec_for_each_page() for iterating over page
    block: optimize blk_bio_segment_split for single-page bvec
    block: optimize __blk_segment_map_sg() for single-page bvec
    block: introduce bvec_nth_page()
    iomap: wire up the iopoll method
    block: add bio_set_polled() helper
    block: wire up block device iopoll method
    fs: add an iopoll method to struct file_operations
    loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
    loop: do not print warn message if partition scan is successful
    block: bounce: make sure that bvec table is updated
    ...

    Linus Torvalds
     

01 Mar, 2019

1 commit


15 Feb, 2019

1 commit

  • Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
    physical segment number is mainly figured out in blk_queue_split() for
    fast path, and the flag of BIO_SEG_VALID is set there too.

    Now only blk_recount_segments() and blk_recalc_rq_segments() use this
    flag.

    Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
    is set in blk_queue_split().

    For another user of blk_recalc_rq_segments():

    - run in partial completion branch of blk_update_request, which is an unusual case

    - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
    since dm-rq is the only user.

    Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
    current setup of the I/O path, as it isn't going to save you a significant amount
    of cycles.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Feb, 2019

1 commit

  • When requeue, if RQF_DONTPREP, rq has contained some driver
    specific data, so insert it to hctx dispatch list to avoid any
    merge. Take scsi as example, here is the trace event log (no
    io scheduler, because RQF_STARTED would prevent merging),

    kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
    scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
    scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
    kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
    scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
    scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
    kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
    kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
    scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

    (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
    Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
    the sdb only contained the part of (32768 + 8), then only that part
    was completed. The lucky thing was that scsi_io_completion detected
    it and requeued the remaining part. So we didn't get corrupted data.
    However, the requeue of (32776 + 8) is not expected.

    Suggested-by: Jens Axboe
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

09 Feb, 2019

1 commit


06 Feb, 2019

2 commits


01 Feb, 2019

2 commits

  • Currently, we check whether the hctx type is supported every time
    in hot path. Actually, this is not necessary, we could save the
    default hctx into ctx->hctxs if the type is not supported when
    map swqueues and use it directly with ctx->hctxs[type].

    We also needn't check whether the poll is enabled or not, because
    the caller would clear the REQ_HIPRI in that case.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • Currently, the queue mapping result is saved in a two-dimensional
    array. In the hot path, to get a hctx, we need do following:

    q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

    This isn't very efficient. We could save the queue mapping result into
    ctx directly with different hctx type, like,

    ctx->hctxs[type]

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

16 Jan, 2019

1 commit

  • We need to pass bio->bi_opf after bio intergrity preparing, otherwise
    the flag of REQ_INTEGRITY may not be set on the allocated request, then
    breaks block integrity.

    Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping")
    Cc: Hannes Reinecke
    Cc: Keith Busch
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

19 Dec, 2018

1 commit

  • block consumers will need it for polling requests that
    are sent with blk_execute_rq_nowait. Also, get rid of
    blk_tag_to_qc_t and open-code it instead.

    Reviewed-by: Jens Axboe
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     

18 Dec, 2018

4 commits

  • The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
    is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
    before enabling IO poll.

    Otherwise IO race & timeout can be observed when running block/007.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • There's a single user of this function, dm, and dm just wants
    to check if IO is inflight, not that it's just allocated.

    This fixes a hang with srp/002 in blktests with dm, where it tries
    to suspend but waits for inflight IO to finish first. As it checks
    for just allocated requests, this fails.

    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping
    table won't be initialized actually if map->nr_queues is zero, so
    we can't use blk_mq_map_queue_type() to retrieve hctx any more.

    This way still may cause broken mapping, fix it by skipping zero-queues
    maps in blk_mq_map_swqueue().

    Cc: Jeff Moyer
    Cc: Mike Snitzer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When a request is added to rq list of sw queue(ctx), the rq may be from
    a different type of hctx, especially after multi queue mapping is
    introduced.

    So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
    blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
    hctx can be dispatched to current hctx in case that read queue or poll
    queue is enabled.

    This patch fixes this issue by introducing per-queue-type list.

    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei

    Changed by me to not use separately cacheline aligned lists, just
    place them all in the same cacheline where we had just the one list
    and lock before.

    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Dec, 2018

1 commit


16 Dec, 2018

3 commits

  • Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
    in blk_insert_cloned_request and kill it as nobody uses it any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • It is not necessary to issue request directly with bypass 'true'
    in blk_mq_sched_insert_requests and handle the non-issued requests
    itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
    handle them totally. Remove the blk_rq_can_direct_dispatch check,
    because blk_mq_try_issue_directly can handle it well.If request is
    direct-issued unsuccessfully, insert the reset.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
    into one interface to unify the interfaces to issue requests
    directly. The merged interface takes over the requests totally,
    it could insert, end or do nothing based on the return value of
    .queue_rq and 'bypass' parameter. Then caller needn't any other
    handling any more and then code could be cleaned up.

    And also the commit c616cbee ( blk-mq: punt failed direct issue
    to dispatch list ) always inserts requests to hctx dispatch list
    whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is
    overkill and will harm the merging. We just need to do that for
    the requests that has been through .queue_rq. This patch also
    could fix this.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

10 Dec, 2018

2 commits

  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
    two corruption fixes. We're going to be overhauling the direct dispatch
    path, and we need to do that on top of the changes we made for that
    in mainline.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Dec, 2018

1 commit

  • Now almost all .map_queues() implementation based on managed irq
    affinity doesn't update queue mapping and it just retrieves the
    old built mapping, so if nr_hw_queues is changed, the mapping talbe
    includes stale mapping. And only blk_mq_map_queues() may rebuild
    the mapping talbe.

    One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
    However, drivers often builds queue mapping before allocating tagset
    via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
    as 1 in case of kdump kernel, so wrong queue mapping is used, and
    kernel panic[1] is observed during booting.

    This patch fixes the kernel panic triggerd on nvme by rebulding the
    mapping table via blk_mq_map_queues().

    [1] kernel panic log
    [ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues
    [ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
    [ 4.444681] PGD 0 P4D 0
    [ 4.445367] Oops: 0000 [#1] SMP NOPTI
    [ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
    [ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
    [ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
    [ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
    [ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
    [ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
    [ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
    [ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
    [ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
    [ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
    [ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
    [ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
    [ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
    [ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 4.477061] PKRU: 55555554
    [ 4.477464] Call Trace:
    [ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad
    [ 4.479595] blk_mq_init_queue+0x32/0x4e
    [ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core]
    [ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
    [ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
    [ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core]
    [ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38
    [ 4.483930] ? try_to_wake_up+0x38d/0x3b3
    [ 4.484478] ? process_one_work+0x179/0x2fc
    [ 4.485118] process_one_work+0x1d3/0x2fc
    [ 4.485655] ? rescuer_thread+0x2ae/0x2ae
    [ 4.486196] worker_thread+0x1e9/0x2be
    [ 4.486841] kthread+0x115/0x11d
    [ 4.487294] ? kthread_park+0x76/0x76
    [ 4.487784] ret_from_fork+0x3a/0x50
    [ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
    [ 4.489428] Dumping ftrace buffer:
    [ 4.489939] (ftrace buffer empty)
    [ 4.490492] CR2: 0000000000000098
    [ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

    Cc: Christoph Hellwig
    Cc: linux-nvme@lists.infradead.org
    Cc: David Milburn
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei