12 Feb, 2019

16 commits

  • bio_check_eod() should check partition size not the whole disk if
    bio->bi_partno is non-zero. Do this by moving the call
    to bio_check_eod() into blk_partition_remap().

    Based on an earlier patch from Jiufei Xue.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reported-by: Jiufei Xue
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

    Christoph Hellwig
     
  • Similar to blkdev_write_iter(), return -EPERM if the partition is
    read-only. This covers ioctl(), fallocate() and most in-kernel users
    but isn't meant to be exhaustive -- everything else will be caught in
    generic_make_request_checks(), fail with -EIO and can be fixed later.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    (cherry picked from commit a13553c777375009584741e7d9982e775c4b0744)

    Ilya Dryomov
     
  • Regular block device writes go through blkdev_write_iter(), which does
    bdev_read_only(), while zeroout/discard/etc requests are never checked,
    both userspace- and kernel-triggered. Add a generic catch-all check to
    generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
    set_disk_ro(), which is used by quite a few drivers for things like
    snapshots, read-only backing files/images, etc.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    (cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

    Ilya Dryomov
     
  • export these two interface for cgroup-v1.

    Acked-by: Tejun Heo
    Signed-off-by: weiping zhang
    Signed-off-by: Jens Axboe
    (cherry picked from commit 17534c6f2c065ad8e34ff6f013e5afaa90428512)

    weiping zhang
     
  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)

    Bart Van Assche
     
  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

    Mike Snitzer
     
  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)

    Mike Snitzer
     
  • device_add_disk() will only call bdi_register_owner() if
    !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
    bdi_unregister() if !GENHD_FL_HIDDEN.

    Found with code inspection. bdi_unregister() won't do any harm if
    bdi_register_owner() wasn't used but best to avoid the unnecessary
    call to bdi_unregister().

    Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit bc8d062c36e3525e81ea8237ff0ab3264c2317b6)

    Mike Snitzer
     
  • That we we can also poll non blk-mq queues. Mostly needed for
    the NVMe multipath code, but could also be useful elsewhere.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

    Christoph Hellwig
     
  • With this flag a driver can create a gendisk that can be used for I/O
    submission inside the kernel, but which is not registered as user
    facing block device. This will be useful for the NVMe multipath
    implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 8ddcd653257c18a669fcb75ee42c37054908e0d6)

    Christoph Hellwig
     
  • The hidden gendisks introduced in the next patch need to keep the dev
    field in their struct device empty so that udev won't try to create
    block device nodes for them. To support that rewrite disk_devt to
    look at the major and first_minor fields in the gendisk itself instead
    of looking into the struct device.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit 517bf3c306bad4fe0da631f90ae2ec40924dee2b)

    Christoph Hellwig
     
  • This helpers allows to bounce steal the uncompleted bios from a request so
    that they can be reissued on another path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

    Christoph Hellwig
     
  • This helper allows reinserting a bio into a new queue without much
    overhead, but requires all queue limits to be the same for the upper
    and lower queues, and it does not provide any recursion preventions.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Javier González
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

    Christoph Hellwig
     
  • This patch does not change any functionality.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 14a23498ba97683c6790b1bcd8b2cdfe9ad99797)

    Bart Van Assche
     
  • These two functions are only called from inside the block layer so
    unexport them.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 83d016ac86428dbca8a62d3e4fdc29e3ea39e535)

    Bart Van Assche
     
  • errata:
    When a read command returns less data than specified in the PRDs (for
    example, there are two PRDs for this command, but the device returns a
    number of bytes which is less than in the first PRD), the second PRD of
    this command is not read out of the PRD FIFO, causing the next command
    to use this PRD erroneously.

    workaround
    - forces sg_tablesize = 1
    - modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
    allocated with dma_alloc_coherent during the probe in ahci_imx
    - In order to fix the scsi/sata hang, when CD_ROM and HDD are
    accessed simultaneously after the workaround is applied.
    Do not go to sleep in scsi_eh_handler, when there is host failed.

    Signed-off-by: Richard Zhu

    Richard Zhu
     

29 Dec, 2018

2 commits

  • [ Upstream commit b88aef36b87c9787a4db724923ec4f57dfd513f3 ]

    If __blkdev_issue_discard is in progress and a device mapper device is
    reloaded with a table that doesn't support discard,
    q->limits.max_discard_sectors is set to zero. This results in infinite
    loop in __blkdev_issue_discard.

    This patch checks if max_discard_sectors is zero and aborts with
    -EOPNOTSUPP.

    Signed-off-by: Mikulas Patocka
    Tested-by: Zdenek Kabelac
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Mikulas Patocka
     
  • [ Upstream commit af097f5d199e2aa3ab3ef777f0716e487b8f7b08 ]

    Don't build discards bigger than what the user asked for, if the
    user decided to limit the size by writing to 'discard_max_bytes'.

    Reviewed-by: Darrick J. Wong
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

21 Dec, 2018

1 commit

  • [ Upstream commit 2527d99789e248576ac8081530cd4fd88730f8c7 ]

    If an IO scheduler is selected via elevator= and it doesn't match
    the driver in question wrt blk-mq support, then we fail to boot.

    The elevator= parameter is deprecated and only supported for
    non-mq devices. Augment the elevator lookup API so that we
    pass in if we're looking for an mq capable scheduler or not,
    so that we only ever return a valid type for the queue in
    question.

    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

21 Nov, 2018

1 commit

  • commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

    c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
    already fixed this race, however the implied synchronize_rcu()
    in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
    performance regression.

    Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    tried to quiesce queue for avoiding unnecessary synchronize_rcu()
    only when queue initialization is done, because it is usual to see
    lots of inexistent LUNs which need to be probed.

    However, turns out it isn't safe to quiesce queue only when queue
    initialization is done. Because when one SCSI command is completed,
    the user of sending command can be waken up immediately, then the
    scsi device may be removed, meantime the run queue in scsi_end_request()
    is still in-progress, so kernel panic can be caused.

    In Red Hat QE lab, there are several reports about this kind of kernel
    panic triggered during kernel booting.

    This patch tries to address the issue by grabing one queue usage
    counter during freeing one request and the following run queue.

    Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    Cc: Andrew Jones
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K. Petersen
    Cc: Christoph Hellwig
    Cc: James E.J. Bottomley
    Cc: stable
    Cc: jianchao.wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

14 Nov, 2018

1 commit

  • [ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]

    BFQ schedules entities (which represent either per-process queues or
    groups of queues) as a function of their timestamps. In particular, as
    a function of their (virtual) finish times. The finish time of an
    entity is computed as a function of the budget assigned to the entity,
    assuming, tentatively, that the entity, once in service, will receive
    an amount of service equal to its budget. Then, when the entity is
    expired because it finishes to be served, this finish time is updated
    as a function of the actual service received by the entity. This
    allows the entity to be correctly charged with only the service
    received, and then to be correctly re-scheduled.

    Yet an entity may receive service also while not being the entity in
    service (in the scheduling environment of its parent entity), for
    several reasons. If the entity remains with no backlog while receiving
    this 'unofficial' service, then it is expired. Also on such an
    expiration, the finish time of the entity should be updated to account
    for only the service actually received by the entity. Unfortunately,
    such an update is not performed for an entity expiring without being
    the entity in service.

    In a similar vein, the service counter of the entity in service is
    reset when the entity is expired, to be ready to be used for next
    service cycle. This reset too should be performed also in case an
    entity is expired because it remains empty after receiving service
    while not being the entity in service. But in this case the reset is
    not performed.

    This commit performs the above update of the finish time and reset of
    the service received, also for an entity expiring while not being the
    entity in service.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

13 Oct, 2018

1 commit

  • commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

    trace_block_unplug() takes true for explicit unplugs and false for
    implicit unplugs. schedule() unplugs are implicit and should be
    reported as timer unplugs. While correct in the legacy code, this has
    been inverted in blk-mq since 4.11.

    Cc: stable@vger.kernel.org
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

26 Sep, 2018

3 commits

  • [ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

    SCSI probing may synchronously create and destroy a lot of request_queues
    for non-existent devices. Any synchronize_rcu() in queue creation or
    destroy path may introduce long latency during booting, see detailed
    description in comment of blk_register_queue().

    This patch removes one synchronize_rcu() inside blk_cleanup_queue()
    for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
    needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
    when queue isn't initialized, it isn't necessary to do that since
    only pass-through requests are involved, no original issue in
    scsi_execute() at all.

    Without this patch and previous one, it may take more 20+ seconds for
    virtio-scsi to complete disk probe. With the two patches, the time becomes
    less than 100ms.

    Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
    Reported-by: Andrew Jones
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit b04f50ab8a74129b3041a2836c33c916be3c6667 ]

    Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

    1) for high-performance SSD, most of times dispatch may succeed, then
    there may be nothing left in ctx->rq_list, so don't try to merge over
    sw queue if it is empty, then we can save one acquiring of ctx->lock

    2) we can't expect good merge performance on per-cpu sw queue, and missing
    one merge on sw queue won't be a big deal since tasks can be scheduled from
    one CPU to another.

    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Kashyap Desai
    Reported-by: Kashyap Desai
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 42c9cdfe1e11e083dceb0f0c4977b758cf7403b9 ]

    Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
    that blk_stack_limits() can stack up this limit for stacked devices.

    before:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    1

    after:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    256

    Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

20 Sep, 2018

4 commits

  • [ Upstream commit 14cb2c8a6c5dae57ee3e2da10fa3db2b9087e39e ]

    The if-block that sets a successful return value in aix_partition()
    uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

    For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
    initialized, but are used anyway if alloc_pvd() succeeds after it.

    So, make the alloc_pvd() call conditional on their initialization.

    This has been hit when attaching an apparently corrupted/stressed
    AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

    [...] partition (null) (11 pp's found) is not contiguous
    [...] partition (null) (2 pp's found) is not contiguous
    [...] partition (null) (3 pp's found) is not contiguous
    [...] partition (null) (64 pp's found) is not contiguous

    Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     
  • [ Upstream commit d43fdae7bac2def8c4314b5a49822cb7f08a45f1 ]

    Even if properly initialized, the lvname array (i.e., strings)
    is read from disk, and might contain corrupt data (e.g., lack
    the null terminating character for strings).

    So, make sure the partition name string used in pr_warn() has
    the null terminating character.

    Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
    Suggested-by: Daniel J. Axtens
    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     
  • [ Upstream commit 75d6e175fc511e95ae3eb8f708680133bc211ed3 ]

    The passed 'nr' from userspace represents the total depth, meantime
    inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
    and 'nr_reserved_tags' stores the reserved part.

    There are two issues in blk_mq_tag_update_depth() now:

    1) for growing tags, we should have used the passed 'nr', and keep the
    number of reserved tags not changed.

    2) the passed 'nr' should have been used for checking against
    'tags->nr_tags', instead of number of the normal part.

    This patch fixes the above two cases, and avoids kernel crash caused
    by wrong resizing sbitmap queue.

    Cc: "Ewan D. Milne"
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Tested by: Marco Patalano
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit d5274b3cd6a814ccb2f56d81ee87cbbf51bd4cf7 upstream.

    Fix trivial use-after-free. This could be last reference to bfqg.

    Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe")
    Acked-by: Paolo Valente
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

15 Sep, 2018

2 commits

  • [ Upstream commit f7ecb1b109da1006a08d5675debe60990e824432 ]

    This patch does not change any functionality but avoids that gcc
    reports the following warnings when building with W=1:

    block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_low_latency_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • [ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]

    In commit ed996a52c868 ("block: simplify and cleanup bvec pool
    handling"), the value of the slab index is incremented by one in
    bvec_alloc() after the allocation is done to indicate an index value of
    0 does not need to be later freed.

    bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
    value. Decrement idx before performing the lookup.

    Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     

10 Sep, 2018

3 commits

  • commit fc8ebd01deeb12728c83381f6ec923e4a192ffd3 upstream.

    The value that struct cftype .write() method returns is then directly
    returned to userspace as the value returned by write() syscall, so it
    should be the number of bytes actually written (or consumed) and not zero.

    Returning zero from write() syscall makes programs like /bin/echo or bash
    spin.

    Signed-off-by: Maciej S. Szmigiero
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Maciej S. Szmigiero
     
  • commit b233f127042dba991229e3882c6217c80492f6ef upstream.

    Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
    disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
    it can't take effect in that way since user space still can switch
    it on via 'echo auto > /sys/block/sdN/device/power/control'.

    This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
    and fixes all kinds of PM related kernel crash.

    Cc: Tomas Janousek
    Cc: Przemek Socha
    Cc: Alan Stern
    Cc:
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Tested-by: Patrick Steinhardt
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

    We find the memory use-after-free issue in __blk_drain_queue()
    on the kernel 4.14. After read the latest kernel 4.18-rc6 we
    think it has the same problem.

    Memory is allocated for q->fq in the blk_init_allocated_queue().
    If the elevator init function called with error return, it will
    run into the fail case to free the q->fq.

    Then the __blk_drain_queue() uses the same memory after the free
    of the q->fq, it will lead to the unpredictable event.

    The patch is to set q->fq as NULL in the fail case of
    blk_init_allocated_queue().

    Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
    Cc:
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: xiao jin
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    xiao jin
     

24 Aug, 2018

1 commit

  • [ Upstream commit ce042c183bcb94eb2919e8036473a1fc203420f9 ]

    resp->num is the number of tokens in resp->tok[]. It gets set in
    response_parse(). So if n == resp->num then we're reading beyond the
    end of the data.

    Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
    Reviewed-by: Scott Bauer
    Tested-by: Scott Bauer
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

18 Aug, 2018

1 commit

  • commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.

    This commit fixes a bug that causes bfq to fail to guarantee a high
    responsiveness on some drives, if there is heavy random read+write I/O
    in the background. More precisely, such a failure allowed this bug to
    be found [1], but the bug may well cause other yet unreported
    anomalies.

    BFQ raises the weight of the bfq_queues associated with soft real-time
    applications, to privilege the I/O, and thus reduce latency, for these
    applications. This mechanism is named soft-real-time weight raising in
    BFQ. A soft real-time period may happen to be nested into an
    interactive weight raising period, i.e., it may happen that, when a
    bfq_queue switches to a soft real-time weight-raised state, the
    bfq_queue is already being weight-raised because deemed interactive
    too. In this case, BFQ saves in a special variable
    wr_start_at_switch_to_srt, the time instant when the interactive
    weight-raising period started for the bfq_queue, i.e., the time
    instant when BFQ started to deem the bfq_queue interactive. This value
    is then used to check whether the interactive weight-raising period
    would still be in progress when the soft real-time weight-raising
    period ends. If so, interactive weight raising is restored for the
    bfq_queue. This restore is useful, in particular, because it prevents
    bfq_queues from losing their interactive weight raising prematurely,
    as a consequence of spurious, short-lived soft real-time
    weight-raising periods caused by wrong detections as soft real-time.

    If, instead, a bfq_queue switches to soft-real-time weight raising
    while it *is not* already in an interactive weight-raising period,
    then the variable wr_start_at_switch_to_srt has no meaning during the
    following soft real-time weight-raising period. Unfortunately the
    handling of this case is wrong in BFQ: not only the variable is not
    flagged somehow as meaningless, but it is also set to the time when
    the switch to soft real-time weight-raising occurs. This may cause an
    interactive weight-raising period to be considered mistakenly as still
    in progress, and thus a spurious interactive weight-raising period to
    start for the bfq_queue, at the end of the soft-real-time
    weight-raising period. In particular the spurious interactive
    weight-raising period will be considered as still in progress, if the
    soft-real-time weight-raising period does not last very long. The
    bfq_queue will then be wrongly privileged and, if I/O bound, will
    unjustly steal bandwidth to truly interactive or soft real-time
    bfq_queues, harming responsiveness and low latency.

    This commit fixes this issue by just setting wr_start_at_switch_to_srt
    to minus infinity (farthest past time instant according to jiffies
    macros): when the soft-real-time weight-raising period ends, certainly
    no interactive weight-raising period will be considered as still in
    progress.

    [1] Background I/O Type: Random - Background I/O mix: Reads and writes
    - Application to start: LibreOffice Writer in
    http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Tested-by: Mirko Montanari
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

03 Aug, 2018

3 commits

  • commit 5151842b9d8732d4cbfa8400b40bff894f501b2f upstream.

    After the bio has been updated to represent the remaining sectors, reset
    bi_done so bio_rewind_iter() does not rewind further than it should.

    This resolves a bio_integrity_process() failure on reads where the
    original request was split.

    Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     
  • commit b403ea2404889e1227812fa9657667a1deb9c694 upstream.

    If the last page of the bio is not "full", the length of the last
    vector slot needs to be corrected. This slot has the index
    (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
    array, which is shifted by the value of bio->bi_vcnt at function
    invocation, the correct index is (nr_pages - 1).

    v2: improved readability following suggestions from Ming Lei.
    v3: followed a formatting suggestion from Christoph Hellwig.

    Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Martin Wilck
     
  • [ Upstream commit a12bffebc0c9d6a5851f062aaea3aa7c4adc6042 ]

    In bfq_requests_merged(), there is a deadlock because the lock on
    bfqq->bfqd->lock is held by the calling function, but the code of
    this function tries to grab the lock again.

    This deadlock is currently hidden by another bug (fixed by next commit
    for this source file), which causes the body of bfq_requests_merged()
    to be never executed.

    This commit removes the deadlock by removing the lock/unlock pair.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filippo Muzzini
     

22 Jul, 2018

1 commit

  • commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

    When blk_queue_enter() waits for a queue to unfreeze, or unset the
    PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

    The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
    ("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
    device is resumed asynchronously, i.e. after un-freezing userspace tasks.

    So that commit exposed the bug as a regression in v4.15. A mysterious
    SIGBUS (or -EIO) sometimes happened during the time the device was being
    resumed. Most frequently, there was no kernel log message, and we saw Xorg
    or Xwayland killed by SIGBUS.[1]

    [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

    Without this fix, I get an IO error in this test:

    # dd if=/dev/sda of=/dev/null iflag=direct & \
    while killall -SIGUSR1 dd; do sleep 0.1; done & \
    echo mem > /sys/power/state ; \
    sleep 5; killall dd # stop after 5 seconds

    The interruptible wait was added to blk_queue_enter in
    commit 3ef28e83ab15 ("block: generic request_queue reference counting").
    Before then, the interruptible wait was only in blk-mq, but I don't think
    it could ever have been correct.

    Reviewed-by: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Alan Jenkins
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Alan Jenkins