29 Oct, 2018

16 commits

  • bio_check_eod() should check partition size not the whole disk if
    bio->bi_partno is non-zero. Do this by moving the call
    to bio_check_eod() into blk_partition_remap().

    Based on an earlier patch from Jiufei Xue.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reported-by: Jiufei Xue
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

    Christoph Hellwig
     
  • Similar to blkdev_write_iter(), return -EPERM if the partition is
    read-only. This covers ioctl(), fallocate() and most in-kernel users
    but isn't meant to be exhaustive -- everything else will be caught in
    generic_make_request_checks(), fail with -EIO and can be fixed later.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    (cherry picked from commit a13553c777375009584741e7d9982e775c4b0744)

    Ilya Dryomov
     
  • Regular block device writes go through blkdev_write_iter(), which does
    bdev_read_only(), while zeroout/discard/etc requests are never checked,
    both userspace- and kernel-triggered. Add a generic catch-all check to
    generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
    set_disk_ro(), which is used by quite a few drivers for things like
    snapshots, read-only backing files/images, etc.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    (cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

    Ilya Dryomov
     
  • export these two interface for cgroup-v1.

    Acked-by: Tejun Heo
    Signed-off-by: weiping zhang
    Signed-off-by: Jens Axboe
    (cherry picked from commit 17534c6f2c065ad8e34ff6f013e5afaa90428512)

    weiping zhang
     
  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)

    Bart Van Assche
     
  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

    Mike Snitzer
     
  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)

    Mike Snitzer
     
  • device_add_disk() will only call bdi_register_owner() if
    !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
    bdi_unregister() if !GENHD_FL_HIDDEN.

    Found with code inspection. bdi_unregister() won't do any harm if
    bdi_register_owner() wasn't used but best to avoid the unnecessary
    call to bdi_unregister().

    Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit bc8d062c36e3525e81ea8237ff0ab3264c2317b6)

    Mike Snitzer
     
  • That we we can also poll non blk-mq queues. Mostly needed for
    the NVMe multipath code, but could also be useful elsewhere.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

    Christoph Hellwig
     
  • With this flag a driver can create a gendisk that can be used for I/O
    submission inside the kernel, but which is not registered as user
    facing block device. This will be useful for the NVMe multipath
    implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 8ddcd653257c18a669fcb75ee42c37054908e0d6)

    Christoph Hellwig
     
  • The hidden gendisks introduced in the next patch need to keep the dev
    field in their struct device empty so that udev won't try to create
    block device nodes for them. To support that rewrite disk_devt to
    look at the major and first_minor fields in the gendisk itself instead
    of looking into the struct device.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit 517bf3c306bad4fe0da631f90ae2ec40924dee2b)

    Christoph Hellwig
     
  • This helpers allows to bounce steal the uncompleted bios from a request so
    that they can be reissued on another path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

    Christoph Hellwig
     
  • This helper allows reinserting a bio into a new queue without much
    overhead, but requires all queue limits to be the same for the upper
    and lower queues, and it does not provide any recursion preventions.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Javier González
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

    Christoph Hellwig
     
  • This patch does not change any functionality.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 14a23498ba97683c6790b1bcd8b2cdfe9ad99797)

    Bart Van Assche
     
  • These two functions are only called from inside the block layer so
    unexport them.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 83d016ac86428dbca8a62d3e4fdc29e3ea39e535)

    Bart Van Assche
     
  • errata:
    When a read command returns less data than specified in the PRDs (for
    example, there are two PRDs for this command, but the device returns a
    number of bytes which is less than in the first PRD), the second PRD of
    this command is not read out of the PRD FIFO, causing the next command
    to use this PRD erroneously.

    workaround
    - forces sg_tablesize = 1
    - modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
    allocated with dma_alloc_coherent during the probe in ahci_imx
    - In order to fix the scsi/sata hang, when CD_ROM and HDD are
    accessed simultaneously after the workaround is applied.
    Do not go to sleep in scsi_eh_handler, when there is host failed.

    Signed-off-by: Richard Zhu

    Richard Zhu
     

13 Oct, 2018

1 commit

  • commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

    trace_block_unplug() takes true for explicit unplugs and false for
    implicit unplugs. schedule() unplugs are implicit and should be
    reported as timer unplugs. While correct in the legacy code, this has
    been inverted in blk-mq since 4.11.

    Cc: stable@vger.kernel.org
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

26 Sep, 2018

3 commits

  • [ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

    SCSI probing may synchronously create and destroy a lot of request_queues
    for non-existent devices. Any synchronize_rcu() in queue creation or
    destroy path may introduce long latency during booting, see detailed
    description in comment of blk_register_queue().

    This patch removes one synchronize_rcu() inside blk_cleanup_queue()
    for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
    needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
    when queue isn't initialized, it isn't necessary to do that since
    only pass-through requests are involved, no original issue in
    scsi_execute() at all.

    Without this patch and previous one, it may take more 20+ seconds for
    virtio-scsi to complete disk probe. With the two patches, the time becomes
    less than 100ms.

    Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
    Reported-by: Andrew Jones
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit b04f50ab8a74129b3041a2836c33c916be3c6667 ]

    Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

    1) for high-performance SSD, most of times dispatch may succeed, then
    there may be nothing left in ctx->rq_list, so don't try to merge over
    sw queue if it is empty, then we can save one acquiring of ctx->lock

    2) we can't expect good merge performance on per-cpu sw queue, and missing
    one merge on sw queue won't be a big deal since tasks can be scheduled from
    one CPU to another.

    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Kashyap Desai
    Reported-by: Kashyap Desai
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 42c9cdfe1e11e083dceb0f0c4977b758cf7403b9 ]

    Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
    that blk_stack_limits() can stack up this limit for stacked devices.

    before:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    1

    after:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    256

    Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

20 Sep, 2018

4 commits

  • [ Upstream commit 14cb2c8a6c5dae57ee3e2da10fa3db2b9087e39e ]

    The if-block that sets a successful return value in aix_partition()
    uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

    For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
    initialized, but are used anyway if alloc_pvd() succeeds after it.

    So, make the alloc_pvd() call conditional on their initialization.

    This has been hit when attaching an apparently corrupted/stressed
    AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

    [...] partition (null) (11 pp's found) is not contiguous
    [...] partition (null) (2 pp's found) is not contiguous
    [...] partition (null) (3 pp's found) is not contiguous
    [...] partition (null) (64 pp's found) is not contiguous

    Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     
  • [ Upstream commit d43fdae7bac2def8c4314b5a49822cb7f08a45f1 ]

    Even if properly initialized, the lvname array (i.e., strings)
    is read from disk, and might contain corrupt data (e.g., lack
    the null terminating character for strings).

    So, make sure the partition name string used in pr_warn() has
    the null terminating character.

    Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
    Suggested-by: Daniel J. Axtens
    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     
  • [ Upstream commit 75d6e175fc511e95ae3eb8f708680133bc211ed3 ]

    The passed 'nr' from userspace represents the total depth, meantime
    inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
    and 'nr_reserved_tags' stores the reserved part.

    There are two issues in blk_mq_tag_update_depth() now:

    1) for growing tags, we should have used the passed 'nr', and keep the
    number of reserved tags not changed.

    2) the passed 'nr' should have been used for checking against
    'tags->nr_tags', instead of number of the normal part.

    This patch fixes the above two cases, and avoids kernel crash caused
    by wrong resizing sbitmap queue.

    Cc: "Ewan D. Milne"
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Tested by: Marco Patalano
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit d5274b3cd6a814ccb2f56d81ee87cbbf51bd4cf7 upstream.

    Fix trivial use-after-free. This could be last reference to bfqg.

    Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe")
    Acked-by: Paolo Valente
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

15 Sep, 2018

2 commits

  • [ Upstream commit f7ecb1b109da1006a08d5675debe60990e824432 ]

    This patch does not change any functionality but avoids that gcc
    reports the following warnings when building with W=1:

    block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_low_latency_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • [ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]

    In commit ed996a52c868 ("block: simplify and cleanup bvec pool
    handling"), the value of the slab index is incremented by one in
    bvec_alloc() after the allocation is done to indicate an index value of
    0 does not need to be later freed.

    bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
    value. Decrement idx before performing the lookup.

    Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     

10 Sep, 2018

3 commits

  • commit fc8ebd01deeb12728c83381f6ec923e4a192ffd3 upstream.

    The value that struct cftype .write() method returns is then directly
    returned to userspace as the value returned by write() syscall, so it
    should be the number of bytes actually written (or consumed) and not zero.

    Returning zero from write() syscall makes programs like /bin/echo or bash
    spin.

    Signed-off-by: Maciej S. Szmigiero
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Maciej S. Szmigiero
     
  • commit b233f127042dba991229e3882c6217c80492f6ef upstream.

    Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
    disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
    it can't take effect in that way since user space still can switch
    it on via 'echo auto > /sys/block/sdN/device/power/control'.

    This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
    and fixes all kinds of PM related kernel crash.

    Cc: Tomas Janousek
    Cc: Przemek Socha
    Cc: Alan Stern
    Cc:
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Tested-by: Patrick Steinhardt
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

    We find the memory use-after-free issue in __blk_drain_queue()
    on the kernel 4.14. After read the latest kernel 4.18-rc6 we
    think it has the same problem.

    Memory is allocated for q->fq in the blk_init_allocated_queue().
    If the elevator init function called with error return, it will
    run into the fail case to free the q->fq.

    Then the __blk_drain_queue() uses the same memory after the free
    of the q->fq, it will lead to the unpredictable event.

    The patch is to set q->fq as NULL in the fail case of
    blk_init_allocated_queue().

    Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
    Cc:
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: xiao jin
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    xiao jin
     

24 Aug, 2018

1 commit

  • [ Upstream commit ce042c183bcb94eb2919e8036473a1fc203420f9 ]

    resp->num is the number of tokens in resp->tok[]. It gets set in
    response_parse(). So if n == resp->num then we're reading beyond the
    end of the data.

    Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
    Reviewed-by: Scott Bauer
    Tested-by: Scott Bauer
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

18 Aug, 2018

1 commit

  • commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.

    This commit fixes a bug that causes bfq to fail to guarantee a high
    responsiveness on some drives, if there is heavy random read+write I/O
    in the background. More precisely, such a failure allowed this bug to
    be found [1], but the bug may well cause other yet unreported
    anomalies.

    BFQ raises the weight of the bfq_queues associated with soft real-time
    applications, to privilege the I/O, and thus reduce latency, for these
    applications. This mechanism is named soft-real-time weight raising in
    BFQ. A soft real-time period may happen to be nested into an
    interactive weight raising period, i.e., it may happen that, when a
    bfq_queue switches to a soft real-time weight-raised state, the
    bfq_queue is already being weight-raised because deemed interactive
    too. In this case, BFQ saves in a special variable
    wr_start_at_switch_to_srt, the time instant when the interactive
    weight-raising period started for the bfq_queue, i.e., the time
    instant when BFQ started to deem the bfq_queue interactive. This value
    is then used to check whether the interactive weight-raising period
    would still be in progress when the soft real-time weight-raising
    period ends. If so, interactive weight raising is restored for the
    bfq_queue. This restore is useful, in particular, because it prevents
    bfq_queues from losing their interactive weight raising prematurely,
    as a consequence of spurious, short-lived soft real-time
    weight-raising periods caused by wrong detections as soft real-time.

    If, instead, a bfq_queue switches to soft-real-time weight raising
    while it *is not* already in an interactive weight-raising period,
    then the variable wr_start_at_switch_to_srt has no meaning during the
    following soft real-time weight-raising period. Unfortunately the
    handling of this case is wrong in BFQ: not only the variable is not
    flagged somehow as meaningless, but it is also set to the time when
    the switch to soft real-time weight-raising occurs. This may cause an
    interactive weight-raising period to be considered mistakenly as still
    in progress, and thus a spurious interactive weight-raising period to
    start for the bfq_queue, at the end of the soft-real-time
    weight-raising period. In particular the spurious interactive
    weight-raising period will be considered as still in progress, if the
    soft-real-time weight-raising period does not last very long. The
    bfq_queue will then be wrongly privileged and, if I/O bound, will
    unjustly steal bandwidth to truly interactive or soft real-time
    bfq_queues, harming responsiveness and low latency.

    This commit fixes this issue by just setting wr_start_at_switch_to_srt
    to minus infinity (farthest past time instant according to jiffies
    macros): when the soft-real-time weight-raising period ends, certainly
    no interactive weight-raising period will be considered as still in
    progress.

    [1] Background I/O Type: Random - Background I/O mix: Reads and writes
    - Application to start: LibreOffice Writer in
    http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Tested-by: Mirko Montanari
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

03 Aug, 2018

3 commits

  • commit 5151842b9d8732d4cbfa8400b40bff894f501b2f upstream.

    After the bio has been updated to represent the remaining sectors, reset
    bi_done so bio_rewind_iter() does not rewind further than it should.

    This resolves a bio_integrity_process() failure on reads where the
    original request was split.

    Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     
  • commit b403ea2404889e1227812fa9657667a1deb9c694 upstream.

    If the last page of the bio is not "full", the length of the last
    vector slot needs to be corrected. This slot has the index
    (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
    array, which is shifted by the value of bio->bi_vcnt at function
    invocation, the correct index is (nr_pages - 1).

    v2: improved readability following suggestions from Ming Lei.
    v3: followed a formatting suggestion from Christoph Hellwig.

    Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Martin Wilck
     
  • [ Upstream commit a12bffebc0c9d6a5851f062aaea3aa7c4adc6042 ]

    In bfq_requests_merged(), there is a deadlock because the lock on
    bfqq->bfqd->lock is held by the calling function, but the code of
    this function tries to grab the lock again.

    This deadlock is currently hidden by another bug (fixed by next commit
    for this source file), which causes the body of bfq_requests_merged()
    to be never executed.

    This commit removes the deadlock by removing the lock/unlock pair.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filippo Muzzini
     

22 Jul, 2018

1 commit

  • commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

    When blk_queue_enter() waits for a queue to unfreeze, or unset the
    PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

    The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
    ("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
    device is resumed asynchronously, i.e. after un-freezing userspace tasks.

    So that commit exposed the bug as a regression in v4.15. A mysterious
    SIGBUS (or -EIO) sometimes happened during the time the device was being
    resumed. Most frequently, there was no kernel log message, and we saw Xorg
    or Xwayland killed by SIGBUS.[1]

    [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

    Without this fix, I get an IO error in this test:

    # dd if=/dev/sda of=/dev/null iflag=direct & \
    while killall -SIGUSR1 dd; do sleep 0.1; done & \
    echo mem > /sys/power/state ; \
    sleep 5; killall dd # stop after 5 seconds

    The interruptible wait was added to blk_queue_enter in
    commit 3ef28e83ab15 ("block: generic request_queue reference counting").
    Before then, the interruptible wait was only in blk-mq, but I don't think
    it could ever have been correct.

    Reviewed-by: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Alan Jenkins
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Alan Jenkins
     

11 Jul, 2018

2 commits

  • commit d5ce4c31d6df518dd8f63bbae20d7423c5018a6c upstream.

    sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
    permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
    is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
    SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

    $ fallocate -zn -l 1k /dev/sdg
    fallocate: fallocate failed: Remote I/O error
    $ fallocate -zn -l 1k /dev/sdg # OK
    $ fallocate -zn -l 1k /dev/sdg # OK

    The following calls succeed because sd_done() sets ->no_write_same in
    response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
    __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

    This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
    and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
    specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
    sd_done() has just set ->no_write_same thus indicating lack of offload
    support.

    Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
    Cc: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    Cc: Janne Huttunen
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit 425a4dba7953e35ffd096771973add6d2f40d2ed upstream.

    blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    Cc: Janne Huttunen
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

03 Jul, 2018

1 commit

  • commit 297ba57dcdec7ea37e702bcf1a577ac32a034e21 upstream.

    This patch avoids that removing a path controlled by the dm-mpath driver
    while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:

    dm_softirq_done+0x326/0x3d0 [dm_mod]
    blk_done_softirq+0x19b/0x1e0
    __do_softirq+0x128/0x60d
    irq_exit+0x100/0x110
    smp_call_function_single_interrupt+0x90/0x330
    call_function_single_interrupt+0xf/0x20

    Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc:
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

26 Jun, 2018

1 commit

  • commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.

    It is not allowed to reinit q->tag_set_list list entry while RCU grace
    period has not completed yet, otherwise the following soft lockup in
    blk_mq_sched_restart() happens:

    [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
    [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
    [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
    [ 1064.256510] Call Trace:
    [ 1064.256664]
    [ 1064.256824] blk_mq_free_request+0xea/0x100
    [ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
    [ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
    [ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
    [ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
    [ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
    [ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
    [ 1064.258007] irq_poll_softirq+0xb7/0xe0
    [ 1064.258165] __do_softirq+0x106/0x2a2
    [ 1064.258328] irq_exit+0x92/0xa0
    [ 1064.258509] do_IRQ+0x4a/0xd0
    [ 1064.258660] common_interrupt+0x7a/0x7a
    [ 1064.258818]

    Meanwhile another context frees other queue but with the same set of
    shared tags:

    [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
    [ 1288.201833] bash D 0 5910 5820 0x00000000
    [ 1288.202016] Call Trace:
    [ 1288.202315] schedule+0x32/0x80
    [ 1288.202462] schedule_timeout+0x1e5/0x380
    [ 1288.203838] wait_for_completion+0xb0/0x120
    [ 1288.204137] __wait_rcu_gp+0x125/0x160
    [ 1288.204287] synchronize_sched+0x6e/0x80
    [ 1288.204770] blk_mq_free_queue+0x74/0xe0
    [ 1288.204922] blk_cleanup_queue+0xc7/0x110
    [ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
    [ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
    [ 1288.205548] kernfs_fop_write+0x109/0x180
    [ 1288.206328] vfs_write+0xb3/0x1a0
    [ 1288.206476] SyS_write+0x52/0xc0
    [ 1288.206624] do_syscall_64+0x68/0x1d0
    [ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    What happened is the following:

    1. There are several MQ queues with shared tags.
    2. One queue is about to be freed and now task is in
    blk_mq_del_queue_tag_set().
    3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
    tag list in order to find hctx to restart.

    Because linked list entry was modified in blk_mq_del_queue_tag_set()
    without proper waiting for a grace period, blk_mq_sched_restart()
    never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

    Fix is simple: reinit list entry after an RCU grace period elapsed.

    Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Cc: stable@vger.kernel.org
    Cc: Sagi Grimberg
    Cc: linux-block@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Roman Pen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Roman Pen
     

21 Jun, 2018

1 commit

  • [ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

    When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval