07 Jun, 2019

2 commits

  • Many userspace tools and services use the proportional-share policy of
    the blkio/io cgroups controller. The CFQ I/O scheduler implemented
    this policy for the legacy block layer. To modify the weight of a
    group in case CFQ was in charge, the 'weight' parameter of the group
    must be modified. On the other hand, the BFQ I/O scheduler implements
    the same policy in blk-mq, but, with BFQ, the parameter to modify has
    a different name: bfq.weight (forced choice until legacy block was
    present, because two different policies cannot share a common parameter
    in cgroups).

    Due to CFQ legacy, most if not all userspace configurations still use
    the parameter 'weight', and for the moment do not seem likely to be
    changed. But, when CFQ went away with legacy block, such a parameter
    ceased to exist.

    So, a simple workaround has been proposed [1] to make all
    configurations work: add a symlink, named weight, to bfq.weight. This
    commit adds such a symlink.

    [1] https://lkml.org/lkml/2019/4/8/555

    Suggested-by: Johannes Thumshirn
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Angelo Ruocco
     
  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Jun, 2019

1 commit


01 Jun, 2019

9 commits


30 May, 2019

1 commit


29 May, 2019

2 commits

  • Now a063057d7c73 ("block: Fix a race between request queue removal and
    the block cgroup controller") has been reverted, and blkcg_exit_queue()
    won't be called in blk_cleanup_queue() any more.

    So don't need to protect generic_make_request_checks() with
    blk_queue_enter(), then the total mess can be cleaned.

    37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device
    removal triggers a crash") is reverted.

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

24 May, 2019

5 commits

  • The following is a description of a hang in blk_mq_freeze_queue_wait().
    The hang happens on attempt to freeze a queue while another task does
    queue unfreeze.

    The root cause is an incorrect sequence of percpu_ref_resurrect() and
    percpu_ref_kill() and as a result those two can be swapped:

    CPU#0 CPU#1
    ---------------- -----------------
    q1 = blk_mq_init_queue(shared_tags)

    q2 = blk_mq_init_queue(shared_tags):
    blk_mq_add_queue_tag_set(shared_tags):
    blk_mq_update_tag_set_depth(shared_tags):
    list_for_each_entry()
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    > blk_mq_freeze_queue_wait()

    blk_cleanup_queue(q1)
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    ^^^^^^ freeze_depth can't guarantee the order

    blk_mq_unfreeze_queue()
    > percpu_ref_resurrect()

    > blk_mq_freeze_queue_wait()
    ^^^^^^ Hang here!!!!

    This wrong sequence raises kernel warning:
    percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
    WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

    But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
    which waits for a zero of a q_usage_counter, which never happens
    because percpu-ref was reinited (instead of being killed) and stays in
    PERCPU state forever.

    How to reproduce:
    - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
    - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
    - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

    Script.py:
    ------
    #!/usr/bin/python3

    import os
    import sys

    while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
    ------

    This bug was first reported and fixed by Roman, previous discussion:
    [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
    [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
    [3] https://patchwork.kernel.org/patch/9268199/

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Roman Pen
    Signed-off-by: Bob Liu
    Signed-off-by: Jens Axboe

    Bob Liu
     
  • At this point these fields aren't used for anything, so we can remove
    them.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We fundamentally do not have a maximum segement size for devices with a
    virt boundary. So don't bother checking it, especially given that the
    existing checks didn't properly work to start with as we never fully
    update the front/back segment size and miss the bi_seg_front_size that
    wuld have been required for some cases.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We currently fail to update the front/back segment size in the bio when
    deciding to allow an otherwise gappy segement to a device with a
    virt boundary. The reason why this did not cause problems is that
    devices with a virt boundary fundamentally don't use segments as we
    know it and thus don't care. Make that assumption formal by forcing
    an unlimited segement size in this case.

    Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ll_merge_requests_fn, unlike all other merge functions,
    reduces nr_phys_segments by one if the last segment of the previous,
    and the first segment of the next segement are contigous. While this
    seems like a nice solution to avoid building smaller than possible
    requests it causes a mismatch between the segments actually present
    in the request and those iterated over by the bvec iterators, including
    __rq_for_each_bio. This can for example mistrigger the single segment
    optimization in the nvme-pci driver, and might lead to mismatching
    nr_phys_segments number when recalculating the number of request
    when inserting a cloned request.

    We could possibly work around this by making the bvec iterators take
    the front and back segment size into account, but that would require
    moving them from the bio to the bio_iter and spreading this mess
    over all users of bvecs. Or we could simply remove this optimization
    under the assumption that most users already build good enough bvecs,
    and that the bio merge patch never cared about this optimization
    either. The latter is what this patch does.

    dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 May, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "This is mainly some late lightnvm changes that came in just before the
    merge window, as well as fixes that have been queued up since the
    initial pull request was frozen.

    This contains:

    - lightnvm changes, fixing race conditions, improving memory
    utilization, and improving pblk compatability (Chansol, Igor,
    Marcin)

    - NVMe pull request with minor fixes all over the map (via Christoph)

    - remove redundant error print in sata_rcar (Geert)

    - struct_size() cleanup (Jackie)

    - dasd CONFIG_LBADF warning fix (Ming)

    - brd cond_resched() improvement (Mikulas)"

    * tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
    block/bio-integrity: use struct_size() in kmalloc()
    nvme: validate cntlid during controller initialisation
    nvme: change locking for the per-subsystem controller list
    nvme: trace all async notice events
    nvme: fix typos in nvme status code values
    nvme-fabrics: remove unused argument
    nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
    nvme-fc: use separate work queue to avoid warning
    nvme-rdma: remove redundant reference between ib_device and tagset
    nvme-pci: mark expected switch fall-through
    nvme-pci: add known admin effects to augument admin effects log page
    nvme-pci: init shadow doorbell after each reset
    brd: add cond_resched to brd_free_pages
    sata_rcar: Remove ata_host_alloc() error printing
    s390/dasd: fix build warning in dasd_eckd_build_cp_raw
    lightnvm: pblk: use nvm_rq_to_ppa_list()
    lightnvm: pblk: simplify partial read path
    lightnvm: do not remove instance under global lock
    lightnvm: track inflight target creations
    lightnvm: pblk: recover only written metadata
    ...

    Linus Torvalds
     

16 May, 2019

1 commit


08 May, 2019

2 commits

  • Pull block updates from Jens Axboe:
    "Nothing major in this series, just fixes and improvements all over the
    map. This contains:

    - Series of fixes for sed-opal (David, Jonas)

    - Fixes and performance tweaks for BFQ (via Paolo)

    - Set of fixes for bcache (via Coly)

    - Set of fixes for md (via Song)

    - Enabling multi-page for passthrough requests (Ming)

    - Queue release fix series (Ming)

    - Device notification improvements (Martin)

    - Propagate underlying device rotational status in loop (Holger)

    - Removal of mtip32xx trim support, which has been disabled for years
    (Christoph)

    - Improvement and cleanup of nvme command handling (Christoph)

    - Add block SPDX tags (Christoph)

    - Cleanup/hardening of bio/bvec iteration (Christoph)

    - A few NVMe pull requests (Christoph)

    - Removal of CONFIG_LBDAF (Christoph)

    - Various little fixes here and there"

    * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
    block: fix mismerge in bvec_advance
    block: don't drain in-progress dispatch in blk_cleanup_queue()
    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
    blk-mq: always free hctx after request queue is freed
    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
    blk-mq: free hw queue's resource in hctx's release handler
    blk-mq: move cancel of requeue_work into blk_mq_release
    blk-mq: grab .q_usage_counter when queuing request from plug code path
    block: fix function name in comment
    nvmet: protect discovery change log event list iteration
    nvme: mark nvme_core_init and nvme_core_exit static
    nvme: move command size checks to the core
    nvme-fabrics: check more command sizes
    nvme-pci: check more command sizes
    nvme-pci: remove an unneeded variable initialization
    nvme-pci: unquiesce admin queue on shutdown
    nvme-pci: shutdown on timeout during deletion
    nvme-pci: fix psdt field for single segment sgls
    nvme-multipath: don't print ANA group state by default
    nvme-multipath: split bios with the ns_head bio_set before submitting
    ...

    Linus Torvalds
     
  • Pull driver core/kobject updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.2-rc1

    There are a number of ACPI patches in here as well, as Rafael said
    they should go through this tree due to the driver core changes they
    required. They have all been acked by the ACPI developers.

    There are also a number of small subsystem-specific changes in here,
    due to some changes to the kobject core code. Those too have all been
    acked by the various subsystem maintainers.

    As for content, it's pretty boring outside of the ACPI changes:
    - spdx cleanups
    - kobject documentation updates
    - default attribute groups for kobjects
    - other minor kobject/driver core fixes

    All have been in linux-next for a while with no reported issues"

    * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
    kobject: clean up the kobject add documentation a bit more
    kobject: Fix kernel-doc comment first line
    kobject: Remove docstring reference to kset
    firmware_loader: Fix a typo ("syfs" -> "sysfs")
    kobject: fix dereference before null check on kobj
    Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
    init/config: Do not select BUILD_BIN2C for IKCONFIG
    Provide in-kernel headers to make extending kernel easier
    kobject: Improve doc clarity kobject_init_and_add()
    kobject: Improve docs for kobject_add/del
    driver core: platform: Fix the usage of platform device name(pdev->name)
    livepatch: Replace klp_ktype_patch's default_attrs with groups
    cpufreq: schedutil: Replace default_attrs field with groups
    padata: Replace padata_attr_type default_attrs field with groups
    irqdesc: Replace irq_kobj_type's default_attrs field with groups
    net-sysfs: Replace ktype default_attrs field with groups
    block: Replace all ktype default_attrs with groups
    samples/kobject: Replace foo_ktype's default_attrs field with groups
    kobject: Add support for default attribute groups to kobj_type
    driver core: Postpone DMA tear-down until after devres release for probe failure
    ...

    Linus Torvalds
     

04 May, 2019

7 commits

  • Now freeing hw queue resource is moved to hctx's release handler,
    we don't need to worry about the race between blk_cleanup_queue and
    run queue any more.

    So don't drain in-progress dispatch in blk_cleanup_queue().

    This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
    freeing queue").

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • hctx is always released after requeue is freed.

    With holding queue's kobject refcount, it is safe for driver to run queue,
    so one run queue might be scheduled after blk_sync_queue() is done.

    So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
    for avoiding run released queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In normal queue cleanup path, hctx is released after request queue
    is freed, see blk_mq_release().

    However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
    of hw queues shrinking. This way is easy to cause use-after-free,
    because: one implicit rule is that it is safe to call almost all block
    layer APIs if the request queue is alive; and one hctx may be retrieved
    by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
    finally use-after-free is triggered.

    Fixes this issue by always freeing hctx after releasing request queue.
    If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
    a per-queue list to hold them, then try to resuse these hctxs if numa
    node is matched.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Split blk_mq_alloc_and_init_hctx into two parts, and one is
    blk_mq_alloc_hctx() for allocating all hctx resources, another
    is blk_mq_init_hctx() for initializing hctx, which serves as
    counter-part of blk_mq_exit_hctx().

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K . Petersen
    Cc: Christoph Hellwig
    Cc: James E . J . Bottomley
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • With holding queue's kobject refcount, it is safe for driver
    to schedule requeue. However, blk_mq_kick_requeue_list() may
    be called after blk_sync_queue() is done because of concurrent
    requeue activities, then requeue work may not be completed when
    freeing queue, and kernel oops is triggered.

    So moving the cancel of requeue_work into blk_mq_release() for
    avoiding race between requeue and freeing queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Just like aio/io_uring, we need to grab 2 refcount for queuing one
    request, one is for submission, another is for completion.

    If the request isn't queued from plug code path, the refcount grabbed
    in generic_make_request() serves for submission. In theroy, this
    refcount should have been released after the sumission(async run queue)
    is done. blk_freeze_queue() works with blk_sync_queue() together
    for avoiding race between cleanup queue and IO submission, given async
    run queue activities are canceled because hctx->run_work is scheduled with
    the refcount held, so it is fine to not hold the refcount when
    running the run queue work function for dispatch IO.

    However, if request is staggered into plug list, and finally queued
    from plug code path, the refcount in submission side is actually missed.
    And we may start to run queue after queue is removed because the queue's
    kobject refcount isn't guaranteed to be grabbed in flushing plug list
    context, then kernel oops is triggered, see the following race:

    blk_mq_flush_plug_list():
    blk_mq_sched_insert_requests()
    insert requests to sw queue or scheduler queue
    blk_mq_run_hw_queue

    Because of concurrent run queue, all requests inserted above may be
    completed before calling the above blk_mq_run_hw_queue. Then queue can
    be freed during the above blk_mq_run_hw_queue().

    Fixes the issue by grab .q_usage_counter before calling
    blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
    safe because the queue is absolutely alive before inserting request.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 May, 2019

1 commit


01 May, 2019

4 commits


30 Apr, 2019

4 commits