09 Jul, 2019

1 commit


07 Jul, 2019

1 commit

  • When the blk-mq debugfs file creation logic was "cleaned up" it was
    cleaned up too much, causing the queue file to not be created in the
    correct location. Turns out the check for the directory being present
    is needed as if that has not happened yet, the files should not be
    created, and the function will be called later on in the initialization
    code so that the files can be created in the correct location.

    Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
    Reported-by: Stephen Rothwell
    Cc: linux-block@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Jens Axboe

    Greg Kroah-Hartman
     

26 Jun, 2019

1 commit

  • By mistake, there is a '&' instead of a '==' in the definition of the
    macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
    the correct one.

    Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

17 Jun, 2019

2 commits

  • When multiple iovecs reference the same page, each get_user_page call
    will add a reference to the page. But once we've created the bio that
    information gets lost and only a single reference will be dropped after
    I/O completion. Use the same_page information returned from
    __bio_try_merge_page to drop additional references to pages that were
    already present in the bio.

    Based on a patch from Ming Lei.

    Link: https://lkml.org/lkml/2019/4/23/64
    Fixes: 576ed913 ("block: use bio_add_page in bio_iov_iter_get_pages")
    Reported-by: David Gibson
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We currently have an input same_page parameter to __bio_try_merge_page
    to prohibit merging in the same page. The rationale for that is that
    some callers need to account for every page added to a bio. Instead of
    letting these callers call twice into the merge code to account for the
    new vs existing page cases, just turn the paramter into an output one that
    returns if a merge in the same page occured and let them act accordingly.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Jun, 2019

1 commit

  • Convert the cgroup-v1 files to ReST format, in order to
    allow a later addition to the admin-guide.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Mauro Carvalho Chehab
     

13 Jun, 2019

3 commits

  • blk_mq_sched_free_requests() may be called in failure path in which
    q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
    blk_mq_sched_free_requests for avoiding the false positive.

    This function is actually safe to call in case of !q->elevator because
    hctx->sched_tags is checked.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Yi Zhang
    Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue")
    Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    When all of these checks are cleaned up, lots of the functions used in
    the blk-mq-debugfs code can now return void, as no need to check the
    return value of them either.

    Overall, this ends up cleaning up the code and making it smaller, always
    a nice win.

    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Jens Axboe

    Greg Kroah-Hartman
     
  • In most use cases of zoned block devices (aka SMR disks), the
    mq-deadline scheduler is mandatory as it implements sequential write
    command processing guarantees with zone write locking. So make sure that
    this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected.

    Tested-by: Chaitanya Kulkarni
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Damien Le Moal
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

10 Jun, 2019

1 commit

  • There's some discussion on how to do this the best, and Tejun prefers
    that BFQ just create the file itself instead of having cgroups support
    a symlink feature.

    Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
    can be done properly for 5.3.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jun, 2019

2 commits

  • Many userspace tools and services use the proportional-share policy of
    the blkio/io cgroups controller. The CFQ I/O scheduler implemented
    this policy for the legacy block layer. To modify the weight of a
    group in case CFQ was in charge, the 'weight' parameter of the group
    must be modified. On the other hand, the BFQ I/O scheduler implements
    the same policy in blk-mq, but, with BFQ, the parameter to modify has
    a different name: bfq.weight (forced choice until legacy block was
    present, because two different policies cannot share a common parameter
    in cgroups).

    Due to CFQ legacy, most if not all userspace configurations still use
    the parameter 'weight', and for the moment do not seem likely to be
    changed. But, when CFQ went away with legacy block, such a parameter
    ceased to exist.

    So, a simple workaround has been proposed [1] to make all
    configurations work: add a symlink, named weight, to bfq.weight. This
    commit adds such a symlink.

    [1] https://lkml.org/lkml/2019/4/8/555

    Suggested-by: Johannes Thumshirn
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Angelo Ruocco
     
  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Jun, 2019

1 commit


01 Jun, 2019

9 commits


30 May, 2019

1 commit


29 May, 2019

2 commits

  • Now a063057d7c73 ("block: Fix a race between request queue removal and
    the block cgroup controller") has been reverted, and blkcg_exit_queue()
    won't be called in blk_cleanup_queue() any more.

    So don't need to protect generic_make_request_checks() with
    blk_queue_enter(), then the total mess can be cleaned.

    37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device
    removal triggers a crash") is reverted.

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

24 May, 2019

5 commits

  • The following is a description of a hang in blk_mq_freeze_queue_wait().
    The hang happens on attempt to freeze a queue while another task does
    queue unfreeze.

    The root cause is an incorrect sequence of percpu_ref_resurrect() and
    percpu_ref_kill() and as a result those two can be swapped:

    CPU#0 CPU#1
    ---------------- -----------------
    q1 = blk_mq_init_queue(shared_tags)

    q2 = blk_mq_init_queue(shared_tags):
    blk_mq_add_queue_tag_set(shared_tags):
    blk_mq_update_tag_set_depth(shared_tags):
    list_for_each_entry()
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    > blk_mq_freeze_queue_wait()

    blk_cleanup_queue(q1)
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    ^^^^^^ freeze_depth can't guarantee the order

    blk_mq_unfreeze_queue()
    > percpu_ref_resurrect()

    > blk_mq_freeze_queue_wait()
    ^^^^^^ Hang here!!!!

    This wrong sequence raises kernel warning:
    percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
    WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

    But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
    which waits for a zero of a q_usage_counter, which never happens
    because percpu-ref was reinited (instead of being killed) and stays in
    PERCPU state forever.

    How to reproduce:
    - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
    - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
    - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

    Script.py:
    ------
    #!/usr/bin/python3

    import os
    import sys

    while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
    ------

    This bug was first reported and fixed by Roman, previous discussion:
    [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
    [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
    [3] https://patchwork.kernel.org/patch/9268199/

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Roman Pen
    Signed-off-by: Bob Liu
    Signed-off-by: Jens Axboe

    Bob Liu
     
  • At this point these fields aren't used for anything, so we can remove
    them.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We fundamentally do not have a maximum segement size for devices with a
    virt boundary. So don't bother checking it, especially given that the
    existing checks didn't properly work to start with as we never fully
    update the front/back segment size and miss the bi_seg_front_size that
    wuld have been required for some cases.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We currently fail to update the front/back segment size in the bio when
    deciding to allow an otherwise gappy segement to a device with a
    virt boundary. The reason why this did not cause problems is that
    devices with a virt boundary fundamentally don't use segments as we
    know it and thus don't care. Make that assumption formal by forcing
    an unlimited segement size in this case.

    Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ll_merge_requests_fn, unlike all other merge functions,
    reduces nr_phys_segments by one if the last segment of the previous,
    and the first segment of the next segement are contigous. While this
    seems like a nice solution to avoid building smaller than possible
    requests it causes a mismatch between the segments actually present
    in the request and those iterated over by the bvec iterators, including
    __rq_for_each_bio. This can for example mistrigger the single segment
    optimization in the nvme-pci driver, and might lead to mismatching
    nr_phys_segments number when recalculating the number of request
    when inserting a cloned request.

    We could possibly work around this by making the bvec iterators take
    the front and back segment size into account, but that would require
    moving them from the bio to the bio_iter and spreading this mess
    over all users of bvecs. Or we could simply remove this optimization
    under the assumption that most users already build good enough bvecs,
    and that the bio merge patch never cared about this optimization
    either. The latter is what this patch does.

    dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 May, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "This is mainly some late lightnvm changes that came in just before the
    merge window, as well as fixes that have been queued up since the
    initial pull request was frozen.

    This contains:

    - lightnvm changes, fixing race conditions, improving memory
    utilization, and improving pblk compatability (Chansol, Igor,
    Marcin)

    - NVMe pull request with minor fixes all over the map (via Christoph)

    - remove redundant error print in sata_rcar (Geert)

    - struct_size() cleanup (Jackie)

    - dasd CONFIG_LBADF warning fix (Ming)

    - brd cond_resched() improvement (Mikulas)"

    * tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
    block/bio-integrity: use struct_size() in kmalloc()
    nvme: validate cntlid during controller initialisation
    nvme: change locking for the per-subsystem controller list
    nvme: trace all async notice events
    nvme: fix typos in nvme status code values
    nvme-fabrics: remove unused argument
    nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
    nvme-fc: use separate work queue to avoid warning
    nvme-rdma: remove redundant reference between ib_device and tagset
    nvme-pci: mark expected switch fall-through
    nvme-pci: add known admin effects to augument admin effects log page
    nvme-pci: init shadow doorbell after each reset
    brd: add cond_resched to brd_free_pages
    sata_rcar: Remove ata_host_alloc() error printing
    s390/dasd: fix build warning in dasd_eckd_build_cp_raw
    lightnvm: pblk: use nvm_rq_to_ppa_list()
    lightnvm: pblk: simplify partial read path
    lightnvm: do not remove instance under global lock
    lightnvm: track inflight target creations
    lightnvm: pblk: recover only written metadata
    ...

    Linus Torvalds
     

16 May, 2019

1 commit


08 May, 2019

2 commits

  • Pull block updates from Jens Axboe:
    "Nothing major in this series, just fixes and improvements all over the
    map. This contains:

    - Series of fixes for sed-opal (David, Jonas)

    - Fixes and performance tweaks for BFQ (via Paolo)

    - Set of fixes for bcache (via Coly)

    - Set of fixes for md (via Song)

    - Enabling multi-page for passthrough requests (Ming)

    - Queue release fix series (Ming)

    - Device notification improvements (Martin)

    - Propagate underlying device rotational status in loop (Holger)

    - Removal of mtip32xx trim support, which has been disabled for years
    (Christoph)

    - Improvement and cleanup of nvme command handling (Christoph)

    - Add block SPDX tags (Christoph)

    - Cleanup/hardening of bio/bvec iteration (Christoph)

    - A few NVMe pull requests (Christoph)

    - Removal of CONFIG_LBDAF (Christoph)

    - Various little fixes here and there"

    * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
    block: fix mismerge in bvec_advance
    block: don't drain in-progress dispatch in blk_cleanup_queue()
    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
    blk-mq: always free hctx after request queue is freed
    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
    blk-mq: free hw queue's resource in hctx's release handler
    blk-mq: move cancel of requeue_work into blk_mq_release
    blk-mq: grab .q_usage_counter when queuing request from plug code path
    block: fix function name in comment
    nvmet: protect discovery change log event list iteration
    nvme: mark nvme_core_init and nvme_core_exit static
    nvme: move command size checks to the core
    nvme-fabrics: check more command sizes
    nvme-pci: check more command sizes
    nvme-pci: remove an unneeded variable initialization
    nvme-pci: unquiesce admin queue on shutdown
    nvme-pci: shutdown on timeout during deletion
    nvme-pci: fix psdt field for single segment sgls
    nvme-multipath: don't print ANA group state by default
    nvme-multipath: split bios with the ns_head bio_set before submitting
    ...

    Linus Torvalds
     
  • Pull driver core/kobject updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.2-rc1

    There are a number of ACPI patches in here as well, as Rafael said
    they should go through this tree due to the driver core changes they
    required. They have all been acked by the ACPI developers.

    There are also a number of small subsystem-specific changes in here,
    due to some changes to the kobject core code. Those too have all been
    acked by the various subsystem maintainers.

    As for content, it's pretty boring outside of the ACPI changes:
    - spdx cleanups
    - kobject documentation updates
    - default attribute groups for kobjects
    - other minor kobject/driver core fixes

    All have been in linux-next for a while with no reported issues"

    * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
    kobject: clean up the kobject add documentation a bit more
    kobject: Fix kernel-doc comment first line
    kobject: Remove docstring reference to kset
    firmware_loader: Fix a typo ("syfs" -> "sysfs")
    kobject: fix dereference before null check on kobj
    Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
    init/config: Do not select BUILD_BIN2C for IKCONFIG
    Provide in-kernel headers to make extending kernel easier
    kobject: Improve doc clarity kobject_init_and_add()
    kobject: Improve docs for kobject_add/del
    driver core: platform: Fix the usage of platform device name(pdev->name)
    livepatch: Replace klp_ktype_patch's default_attrs with groups
    cpufreq: schedutil: Replace default_attrs field with groups
    padata: Replace padata_attr_type default_attrs field with groups
    irqdesc: Replace irq_kobj_type's default_attrs field with groups
    net-sysfs: Replace ktype default_attrs field with groups
    block: Replace all ktype default_attrs with groups
    samples/kobject: Replace foo_ktype's default_attrs field with groups
    kobject: Add support for default attribute groups to kobj_type
    driver core: Postpone DMA tear-down until after devres release for probe failure
    ...

    Linus Torvalds
     

04 May, 2019

6 commits

  • Now freeing hw queue resource is moved to hctx's release handler,
    we don't need to worry about the race between blk_cleanup_queue and
    run queue any more.

    So don't drain in-progress dispatch in blk_cleanup_queue().

    This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
    freeing queue").

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • hctx is always released after requeue is freed.

    With holding queue's kobject refcount, it is safe for driver to run queue,
    so one run queue might be scheduled after blk_sync_queue() is done.

    So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
    for avoiding run released queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In normal queue cleanup path, hctx is released after request queue
    is freed, see blk_mq_release().

    However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
    of hw queues shrinking. This way is easy to cause use-after-free,
    because: one implicit rule is that it is safe to call almost all block
    layer APIs if the request queue is alive; and one hctx may be retrieved
    by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
    finally use-after-free is triggered.

    Fixes this issue by always freeing hctx after releasing request queue.
    If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
    a per-queue list to hold them, then try to resuse these hctxs if numa
    node is matched.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Split blk_mq_alloc_and_init_hctx into two parts, and one is
    blk_mq_alloc_hctx() for allocating all hctx resources, another
    is blk_mq_init_hctx() for initializing hctx, which serves as
    counter-part of blk_mq_exit_hctx().

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K . Petersen
    Cc: Christoph Hellwig
    Cc: James E . J . Bottomley
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • With holding queue's kobject refcount, it is safe for driver
    to schedule requeue. However, blk_mq_kick_requeue_list() may
    be called after blk_sync_queue() is done because of concurrent
    requeue activities, then requeue work may not be completed when
    freeing queue, and kernel oops is triggered.

    So moving the cancel of requeue_work into blk_mq_release() for
    avoiding race between requeue and freeing queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei