16 Aug, 2019

1 commit

  • We had a few issues with this code, and there's still a problem around
    how we deal with error handling for chained/split bios. For now, just
    revert the code and we'll try again with a thoroug solution. This
    reverts commits:

    e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
    0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
    6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2019

1 commit

  • blk_exit_queue will free elevator_data, while blk_mq_requeue_work
    will access it. Move cancel of requeue_work to the front of
    blk_exit_queue to avoid use-after-free.

    blk_exit_queue blk_mq_requeue_work
    __elevator_exit blk_mq_run_hw_queues
    blk_mq_exit_sched blk_mq_run_hw_queue
    dd_exit_queue blk_mq_hctx_has_pending
    kfree(elevator_data) blk_mq_sched_has_work
    dd_has_work

    Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     

08 Aug, 2019

3 commits

  • As reported in [1], the call bfq_init_rq(rq) may return NULL in case
    of OOM (in particular, if rq->elv.icq is NULL because memory
    allocation failed in failed in ioc_create_icq()).

    This commit handles this circumstance.

    [1] https://lkml.org/lkml/2019/7/22/824

    Cc: Hsin-Yi Wang
    Cc: Nicolas Boichat
    Cc: Doug Anderson
    Reported-by: Guenter Roeck
    Reported-by: Hsin-Yi Wang
    Reviewed-by: Guenter Roeck
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), every bfq_queue has a pointer to a
    waker bfq_queue and a list of the bfq_queues it may wake. In this
    respect, when a bfq_queue, say Q, remains with no I/O source attached
    to it, Q cannot be woken by any other bfq_queue, and cannot wake any
    other bfq_queue. Then Q must be removed from the woken list of its
    possible waker bfq_queue, and all bfq_queues in the woken list of Q
    must stop having a waker bfq_queue.

    Q remains with no I/O source in two cases: when the last process
    associated with Q exits or when such a process gets associated with a
    different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq:
    detect wakers and unconditionally inject their I/O") performed the
    above updates only in the first case.

    This commit fixes this bug by moving these updates to when Q gets
    freed. This is a simple and safe way to handle all cases, as both the
    above events, process exit and re-association, lead to Q being freed
    soon, and because dangling references would come out only after Q gets
    freed (if no update were performed).

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), BFQ stores, in a per-device
    pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
    request completed. If some bfq_queue receives new I/O right after the
    last request of last_completed_rq_bfqq has been completed, then
    last_completed_rq_bfqq may be a waker bfq_queue.

    But if the bfq_queue last_completed_rq_bfqq points to is freed, then
    last_completed_rq_bfqq becomes a dangling reference. This commit
    resets last_completed_rq_bfqq if the pointed bfq_queue is freed.

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

27 Jul, 2019

3 commits

  • Pull block DMA segment fix from Jens Axboe:
    "Here's the virtual boundary segment size fix"

    * tag 'for-linus-20190726-2' of git://git.kernel.dk/linux-block:
    block: fix max segment size handling in blk_queue_virt_boundary

    Linus Torvalds
     
  • We should only set the max segment size to unlimited if we actually
    have a virt boundary. Otherwise we accidentally clear that limit
    when called from the SCSI midlayer, which always calls
    blk_queue_virt_boundary, even if that mask is 0.

    Fixes: 7ad388d8e4c7 ("scsi: core: add a host / host template field for the virt boundary")
    Reported-by: Guenter Roeck
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull block fixes from Jens Axboe:

    - Several io_uring fixes/improvements:
    - Blocking fix for O_DIRECT (me)
    - Latter page slowness for registered buffers (me)
    - Fix poll hang under certain conditions (me)
    - Defer sequence check fix for wrapped rings (Zhengyuan)
    - Mismatch in async inc/dec accounting (Zhengyuan)
    - Memory ordering issue that could cause stall (Zhengyuan)
    - Track sequential defer in bytes, not pages (Zhengyuan)

    - NVMe pull request from Christoph

    - Set of hang fixes for wbt (Josef)

    - Redundant error message kill for libahci (Ding)

    - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

    - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

    - blkcg ->pd_stat() non-debug print (Tejun)

    - bcache memory leak fix (Wei)

    - Comment fix (Akinobu)

    - BFQ perf regression fix (Paolo)

    * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
    io_uring: ensure ->list is initialized for poll commands
    Revert "nvme-pci: don't create a read hctx mapping without read queues"
    nvme: fix multipath crash when ANA is deactivated
    nvme: fix memory leak caused by incorrect subsystem free
    nvme: ignore subnqn for ADATA SX6000LNP
    drbd: dynamically allocate shash descriptor
    block: blk-mq: Remove blk_mq_sched_started_request and started_request
    bcache: fix possible memory leak in bch_cached_dev_run()
    io_uring: track io length in async_list based on bytes
    io_uring: don't use iov_iter_advance() for fixed buffers
    block: properly handle IOCB_NOWAIT for async O_DIRECT IO
    blk-mq: allow REQ_NOWAIT to return an error inline
    io_uring: add a memory barrier before atomic_read
    rq-qos: use a mb for got_token
    rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
    rq-qos: don't reset has_sleepers on spurious wakeups
    rq-qos: fix missed wake-ups in rq_qos_throttle
    wait: add wq_has_single_sleeper helper
    block, bfq: check also in-flight I/O in dispatch plugging
    block: fix sysfs module parameters directory path in comment
    ...

    Linus Torvalds
     

23 Jul, 2019

1 commit


22 Jul, 2019

1 commit

  • By default, if a caller sets REQ_NOWAIT and we need to block, we'll
    return -EAGAIN through the bio->bi_end_io() callback. For some use
    cases, this makes it hard to use.

    Allow a caller to ask for inline return of errors related to
    blocking by also setting REQ_NOWAIT_INLINE.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Jul, 2019

4 commits

  • Oleg noticed that our checking of data.got_token is unsafe in the
    cleanup case, and should really use a memory barrier. Use a wmb on the
    write side, and a rmb() on the read side. We don't need one in the main
    loop since we're saved by set_current_state().

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • In case we get a spurious wakeup we need to make sure to re-set
    ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait.

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • If we raced with somebody else getting an inflight counter we could fail
    to get an inflight counter with no sleepers on the list, and thus need
    to go to sleep. In this case has_sleepers should be true because we are
    now relying on the waker to get our inflight counter for us. And in the
    case of spurious wakeups we'd still want this to be the case. So set
    has_sleepers to true if we went to sleep to make sure we're woken up the
    proper way.

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We saw a hang in production with WBT where there was only one waiter in
    the throttle path and no outstanding IO. This is because of the
    has_sleepers optimization that is used to make sure we don't steal an
    inflight counter for new submitters when there are people already on the
    list.

    We can race with our check to see if the waitqueue has any waiters (this
    is done locklessly) and the time we actually add ourselves to the
    waitqueue. If this happens we'll go to sleep and never be woken up
    because nobody is doing IO to wake us up.

    Fix this by checking if the waitqueue has a single sleeper on the list
    after we add ourselves, that way we have an uptodate view of the list.

    Reviewed-by: Oleg Nesterov
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

18 Jul, 2019

1 commit

  • Consider a sync bfq_queue Q that remains empty while in service, and
    suppose that, when this happens, there is a fair amount of already
    in-flight I/O not belonging to Q. In such a situation, I/O dispatching
    may need to be plugged (until new I/O arrives for Q), for the
    following reason.

    The drive may decide to serve in-flight non-Q's I/O requests before
    Q's ones, thereby delaying the arrival of new I/O requests for Q
    (recall that Q is sync). If I/O-dispatching is not plugged, then,
    while Q remains empty, a basically uncontrolled amount of I/O from
    other queues may be dispatched too, possibly causing the service of
    Q's I/O to be delayed even longer in the drive. This problem gets more
    and more serious as the speed and the queue depth of the drive grow,
    because, as these two quantities grow, the probability to find no
    queue busy but many requests in flight grows too.

    If Q has the same weight and priority as the other queues, then the
    above delay is unlikely to cause any issue, because all queues tend to
    undergo the same treatment. So, since not plugging I/O dispatching is
    convenient for throughput, it is better not to plug. Things change in
    case Q has a higher weight or priority than some other queue, because
    Q's service guarantees may simply be violated. For this reason,
    commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric
    scenarios") does plug I/O in such an asymmetric scenario. Plugging
    minimizes the delay induced by already in-flight I/O, and enables Q to
    recover the bandwidth it may lose because of this delay.

    Yet the above commit does not cover the case of weight-raised queues,
    for efficiency concerns. For weight-raised queues, I/O-dispatch
    plugging is activated simply if not all bfq_queues are
    weight-raised. But this check does not handle the case of in-flight
    requests, because a bfq_queue may become non busy *before* all its
    in-flight requests are completed.

    This commit performs I/O-dispatch plugging for weight-raised queues if
    there are some in-flight requests.

    As a practical example of the resulting recover of control, under
    write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5
    seconds after this fix, against 15 seconds before the fix (as a
    reference, gnome-terminal takes about 35 seconds to start with any of
    the other I/O schedulers).

    Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

17 Jul, 2019

3 commits

  • Pull rst conversion of docs from Mauro Carvalho Chehab:
    "As agreed with Jon, I'm sending this big series directly to you, c/c
    him, as this series required a special care, in order to avoid
    conflicts with other trees"

    * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
    docs: kbuild: fix build with pdf and fix some minor issues
    docs: block: fix pdf output
    docs: arm: fix a breakage with pdf output
    docs: don't use nested tables
    docs: gpio: add sysfs interface to the admin-guide
    docs: locking: add it to the main index
    docs: add some directories to the main documentation index
    docs: add SPDX tags to new index files
    docs: add a memory-devices subdir to driver-api
    docs: phy: place documentation under driver-api
    docs: serial: move it to the driver-api
    docs: driver-api: add remaining converted dirs to it
    docs: driver-api: add xilinx driver API documentation
    docs: driver-api: add a series of orphaned documents
    docs: admin-guide: add a series of orphaned documents
    docs: cgroup-v1: add it to the admin-guide book
    docs: aoe: add it to the driver-api book
    docs: add some documentation dirs to the driver-api book
    docs: driver-model: move it to the driver-api book
    docs: lp855x-driver.rst: add it to the driver-api book
    ...

    Linus Torvalds
     
  • The runtime configurable module parameter files are located under
    /sys/module/MODULENAME/parameters, not /sys/module/MODULENAME.

    Cc: Jens Axboe
    Signed-off-by: Akinobu Mita
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • Currently, ->pd_stat() is called only when moduleparam
    blkcg_debug_stats is set which prevents it from printing non-debug
    policy-specific statistics. Let's move debug testing down so that
    ->pd_stat() can print non-debug stat too. This patch doesn't cause
    any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jul, 2019

3 commits


12 Jul, 2019

4 commits

  • Limit the size of the struct blk_zone array used in
    blk_revalidate_disk_zones() to avoid memory allocation failures leading
    to disk revalidation failure. Also further reduce the likelyhood of
    such failures by using kvcalloc() (that is vmalloc()) instead of
    allocating contiguous pages with alloc_pages().

    Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
    preparation of using vmalloc() for large report buffer and zone array
    allocations used by this function, remove its "gfp_t gfp_mask" argument
    and rely on the caller context to use memalloc_noio_save/restore() where
    necessary (block layer zone revalidation and dm-zoned I/O error path).

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • To allow the SCSI subsystem scsi_execute_req() function to issue
    requests using large buffers that are better allocated with vmalloc()
    rather than kmalloc(), modify bio_map_kern() to allow passing a buffer
    allocated with vmalloc().

    To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For
    vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(),
    use vmalloc_to_page() instead of virt_to_page() to obtain the pages of
    the buffer, and invalidate the buffer addresses with
    invalidate_kernel_vmap_range() on completion of read BIOs. This last
    point is executed using the function bio_invalidate_vmalloc_pages()
    which is defined only if the architecture defines
    ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture
    actually needs the invalidation done.

    Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to
    hold integrity metadata. Later on, the buffer will be attached to the bio
    structure through bio_integrity_add_page(), which returns the number of
    bytes of integrity metadata attached. Due to unexpected situations,
    bio_integrity_add_page() may return 0. As a result, bio_integrity_prep()
    needs to be terminated with 'false' returned to indicate this error.
    However, the allocated kernel buffer is not freed on this execution path,
    leading to a memory leak.

    To fix this issue, free the allocated buffer before returning from
    bio_integrity_prep().

    Reviewed-by: Ming Lei
    Acked-by: Martin K. Petersen
    Signed-off-by: Wenwen Wang
    Signed-off-by: Jens Axboe

    Wenwen Wang
     

11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

10 Jul, 2019

7 commits

  • After commit 991f61fe7e1d ("Blk-throttle: reduce tail io latency when
    iops limit is enforced") wait time could be zero even if group is
    throttled and cannot issue requests right now. As a result
    throtl_select_dispatch() turns into busy-loop under irq-safe queue
    spinlock.

    Fix is simple: always round up target time to the next throttle slice.

    Fixes: 991f61fe7e1d ("Blk-throttle: reduce tail io latency when iops limit is enforced")
    Signed-off-by: Konstantin Khlebnikov
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • For large values of the number of zones reported and/or large zone
    sizes, the sector increment calculated with

    blk_queue_zone_sectors(q) * n

    in blk_report_zones() loop can overflow the unsigned int type used for
    the calculation as both "n" and blk_queue_zone_sectors() value are
    unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
    overflow happens with 8192 or more zones reported.

    Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
    this problem and avoids overflow problem for all other callers of this
    helper too. The same change is also applied to the bdev_zone_sectors()
    helper.

    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • When a shared kthread needs to issue a bio for a cgroup, doing so
    synchronously can lead to priority inversions as the kthread can be
    trapped waiting for that cgroup. This patch implements
    REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
    to a dedicated per-blkcg work item to avoid such priority inversions.

    This will be used to fix priority inversions in btrfs compression and
    should be generally useful as we grow filesystem support for
    comprehensive IO control.

    Cc: Chris Mason
    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • btrfs is going to use css_put() and wbc helpers to improve cgroup
    writeback support. Add dummy css_get() definition and export wbc
    helpers to prepare for module and !CONFIG_CGROUP builds.

    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the psi stuff in place we can use the memstall flag to indicate
    pressure that happens from throttling.

    Signed-off-by: Josef Bacik
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We discovered a problem in newer kernels where a disconnect of a NBD
    device while the flush request was pending would result in a hang. This
    is because the blk mq timeout handler does

    if (!refcount_inc_not_zero(&rq->ref))
    return true;

    to determine if it's ok to run the timeout handler for the request.
    Flush_rq's don't have a ref count set, so we'd skip running the timeout
    handler for this request and it would just sit there in limbo forever.

    Fix this by always setting the refcount of any request going through
    blk_init_rq() to 1. I tested this with a nbd-server that dropped flush
    requests to verify that it hung, and then tested with this patch to
    verify I got the timeout as expected and the error handling kicked in.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

09 Jul, 2019

1 commit


07 Jul, 2019

1 commit

  • When the blk-mq debugfs file creation logic was "cleaned up" it was
    cleaned up too much, causing the queue file to not be created in the
    correct location. Turns out the check for the directory being present
    is needed as if that has not happened yet, the files should not be
    created, and the function will be called later on in the initialization
    code so that the files can be created in the correct location.

    Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
    Reported-by: Stephen Rothwell
    Cc: linux-block@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Jens Axboe

    Greg Kroah-Hartman
     

06 Jul, 2019

1 commit

  • The iolatency controller is based on rq_qos. It increments on
    rq_qos_throttle() and decrements on either rq_qos_cleanup() or
    rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
    blk_mq_make_request() may call both rq_qos_cleanup() and
    rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
    double decrement.

    The above works upstream as the only way we can get STS_AGAIN is from
    blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
    problem as bio_endio() skipping only happens on reserved tag allocation
    failures which can only be caused by driver bugs and already triggers
    WARN.

    However, the fix creates a not so great dependency on how STS_AGAIN can
    be propagated. Internally, we (Facebook) carry a patch that kills read
    ahead if a cgroup is io congested or a fatal signal is pending. This
    combined with chained bios progagate their bi_status to the parent is
    not already set can can cause the parent bio to not clean up properly
    even though it was successful. This consequently leaks the inflight
    counter and can hang all IOs under that blkg.

    To nip the adverse interaction early, this removes the rq_qos_cleanup()
    callback in iolatency in favor of cleaning up always on the
    rq_qos_done_bio() path.

    Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
    Debugged-by: Tejun Heo
    Debugged-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

03 Jul, 2019

3 commits

  • Fix a regression introduced when removing bi_phys_segments for Write Zeroes
    requests, which need to have a segment count of zero, as they don't have a
    payload.

    Fixes: 14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio")
    Reported-by: Jens Axboe
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Move the blk_mq_bio_to_request() call in front of the if-statement.

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Minwoo Im
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Jul, 2019

1 commit

  • 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
    bytes.

    Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
    include very limited pages, and usually at most 256, so the fs bio
    size won't be bigger than 1M bytes most of times.

    Since we support multi-page bvec, in theory one fs bio really can
    be added > 1M pages, especially in case of hugepage, or big writeback
    with too many dirty pages. Then there is chance in which .bi_size
    is overflowed.

    Fixes this issue by using bio_full() to check if the added segment may
    overflow .bi_size.

    Cc: Liu Yiding
    Cc: kernel test robot
    Cc: "Darrick J. Wong"
    Cc: linux-xfs@vger.kernel.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 07173c3ec276 ("block: enable multipage bvecs")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei