13 Jan, 2021

3 commits

  • [ Upstream commit 52abca64fd9410ea6c9a3a74eab25663b403d7da ]

    blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
    power management state. Now that SCSI domain validation no longer depends
    on this behavior, modify the behavior of blk_queue_enter() as follows:

    - Do not accept any requests while suspended.

    - Only process power management requests while suspending or resuming.

    Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
    causes runtime-suspended devices not to resume as they should. The request
    which should cause a runtime resume instead gets issued directly, without
    resuming the device first. Of course the device can't handle it properly,
    the I/O fails, and the device remains suspended.

    The problem is fixed by checking that the queue's runtime-PM status isn't
    RPM_SUSPENDED before allowing a request to be issued, and queuing a
    runtime-resume request if it is. In particular, the inline
    blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
    code is unified by merging the surrounding checks into the routine. If the
    queue isn't set up for runtime PM, or there currently is no restriction on
    allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
    flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
    is queued and the request is blocked until conditions are more suitable.

    [ bvanassche: modified commit message and removed Cc: stable because
    without the previous patches from this series this patch would break
    parallel SCSI domain validation + introduced queue_rpm_status() ]

    Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Reported-and-tested-by: Martin Kepplinger
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Can Guo
    Signed-off-by: Alan Stern
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Alan Stern
     
  • [ Upstream commit a4d34da715e3cb7e0741fe603dcd511bed067e00 ]

    Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
    used by any kernel code.

    Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Alan Stern
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Martin Kepplinger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Bart Van Assche
     
  • [ Upstream commit 0854bcdcdec26aecdc92c303816f349ee1fba2bc ]

    Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
    functions set RQF_PM. This is the first step towards removing
    BLK_MQ_REQ_PREEMPT.

    Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
    Cc: Alan Stern
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Can Guo
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Bart Van Assche
     

14 Oct, 2020

1 commit

  • A zoned device with limited resources to open or activate zones may
    return an error when the host exceeds those limits. The same command may
    be successful if retried later, but the host needs to wait for specific
    zone states before it should expect a retry to succeed. Have the block
    layer provide an appropriate status for these conditions so applications
    can distinuguish this error for special handling.

    Cc: linux-api@vger.kernel.org
    Cc: Niklas Cassel
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

09 Oct, 2020

1 commit

  • syzbot is reporting unkillable task [1], for the caller is failing to
    handle a corrupted filesystem image which attempts to access beyond
    the end of the device. While we need to fix the caller, flooding the
    console with handle_bad_sector() message is unlikely useful.

    [1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

06 Oct, 2020

1 commit

  • blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
    setup_clone() in drivers/md/dm-rq.c.

    This case isn't currently reachable with a bio that actually has an
    encryption context. However, it's fragile to rely on this. Just make
    blk_crypto_rq_bio_prep() able to fail.

    Suggested-by: Satya Tangirala
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Snitzer
    Reviewed-by: Satya Tangirala
    Cc: Miaohe Lin
    Signed-off-by: Jens Axboe

    Eric Biggers
     

25 Sep, 2020

3 commits

  • Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
    REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
    applicable.

    Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
    update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.

    Reported-by: Konstantin Khlebnikov
    Suggested-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
    Either the file system allocates its own bdi (e.g. btrfs), in which case
    it is known to support cgroup writeback, or the bdi comes from the block
    layer, which always supports cgroup writeback.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Set up a readahead size by default, as very few users have a good
    reason to change it. This means code, ecryptfs, and orangefs now
    set up the values while they were previously missing it, while ubifs,
    mtd and vboxsf manually set it to 0 to avoid readahead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Acked-by: David Sterba [btrfs]
    Acked-by: Richard Weinberger [ubifs, mtd]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • The per-hctx nr_active value can no longer be used to fairly assign a share
    of tag depth per request queue for when using a shared sbitmap, as it does
    not consider that the tags are shared tags over all hctx's.

    For this case, record the nr_active_requests per request_queue, and make
    the judgement based on that value.

    Co-developed-with: Kashyap Desai
    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     

02 Sep, 2020

4 commits


01 Sep, 2020

1 commit

  • If a driver leaves the limit settings as the defaults, then we don't
    initialize bdi->io_pages. This means that file systems may need to
    work around bdi->io_pages == 0, which is somewhat messy.

    Initialize the default value just like we do for ->ra_pages.

    Cc: stable@vger.kernel.org
    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reported-by: OGAWA Hirofumi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Aug, 2020

1 commit

  • Pull io_uring updates from Jens Axboe:
    "Lots of cleanups in here, hardening the code and/or making it easier
    to read and fixing bugs, but a core feature/change too adding support
    for real async buffered reads. With the latter in place, we just need
    buffered write async support and we're done relying on kthreads for
    the fast path. In detail:

    - Cleanup how memory accounting is done on ring setup/free (Bijan)

    - sq array offset calculation fixup (Dmitry)

    - Consistently handle blocking off O_DIRECT submission path (me)

    - Support proper async buffered reads, instead of relying on kthread
    offload for that. This uses the page waitqueue to drive retries
    from task_work, like we handle poll based retry. (me)

    - IO completion optimizations (me)

    - Fix race with accounting and ring fd install (me)

    - Support EPOLLEXCLUSIVE (Jiufei)

    - Get rid of the io_kiocb unionizing, made possible by shrinking
    other bits (Pavel)

    - Completion side cleanups (Pavel)

    - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

    - Request environment grabbing cleanups (Pavel)

    - File and socket read/write cleanups (Pavel)

    - Improve kiocb_set_rw_flags() (Pavel)

    - Tons of fixes and cleanups (Pavel)

    - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

    * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
    io_uring: flip if handling after io_setup_async_rw
    fs: optimise kiocb_set_rw_flags()
    io_uring: don't touch 'ctx' after installing file descriptor
    io_uring: get rid of atomic FAA for cq_timeouts
    io_uring: consolidate *_check_overflow accounting
    io_uring: fix stalled deferred requests
    io_uring: fix racy overflow count reporting
    io_uring: deduplicate __io_complete_rw()
    io_uring: de-unionise io_kiocb
    io-wq: update hash bits
    io_uring: fix missing io_queue_linked_timeout()
    io_uring: mark ->work uninitialised after cleanup
    io_uring: deduplicate io_grab_files() calls
    io_uring: don't do opcode prep twice
    io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
    io_uring: batch put_task_struct()
    tasks: add put_task_struct_many()
    io_uring: return locked and pinned page accounting
    io_uring: don't miscount pinned memory
    io_uring: don't open-code recv kbuf managment
    ...

    Linus Torvalds
     

08 Jul, 2020

1 commit

  • If blk_mq_submit_bio flushes the plug list, bios for other disks can
    show up on current->bio_list. As that doesn't involve any stacking of
    block device it is entirely harmless and we should not warn about
    this case.

    Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
    Reported-by: kernel test robot
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Jul, 2020

1 commit


01 Jul, 2020

8 commits


29 Jun, 2020

1 commit

  • blkcg_bio_issue_check is a giant inline function that does three entirely
    different things. Factor out the blk-cgroup related bio initalization
    into a new helper, and the open code the sequence in the only caller,
    relying on the fact that all the actual functionality is stubbed out for
    non-cgroup builds.

    Acked-by: Tejun Heo
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jun, 2020

4 commits

  • We were only creating the request_queue debugfs_dir only
    for make_request block drivers (multiqueue), but never for
    request-based block drivers. We did this as we were only
    creating non-blktrace additional debugfs files on that directory
    for make_request drivers. However, since blktrace *always* creates
    that directory anyway, we special-case the use of that directory
    on blktrace. Other than this being an eye-sore, this exposes
    request-based block drivers to the same debugfs fragile
    race that used to exist with make_request block drivers
    where if we start adding files onto that directory we can later
    run a race with a double removal of dentries on the directory
    if we don't deal with this carefully on blktrace.

    Instead, just simplify things by always creating the request_queue
    debugfs_dir on request_queue registration. Rename the mutex also to
    reflect the fact that this is used outside of the blktrace context.

    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     
  • Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
    v4.12 moved the work behind blk_release_queue() into a workqueue after a
    splat floated around which indicated some work on blk_release_queue()
    could sleep in blk_exit_rl(). This splat would be possible when a driver
    called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
    as its final call) from an atomic context.

    blk_put_queue() decrements the refcount for the request_queue kobject, and
    upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
    now removed through commit db6d99523560 ("block: remove request_list code")
    on v5.0, we reserve the right to be able to sleep within
    blk_release_queue() context.

    The last reference for the request_queue must not be called from atomic
    context. *When* the last reference to the request_queue reaches 0 varies,
    and so let's take the opportunity to document when that is expected to
    happen and also document the context of the related calls as best as
    possible so we can avoid future issues, and with the hopes that the
    synchronous request_queue removal sticks.

    We revert back to synchronous request_queue removal because asynchronous
    removal creates a regression with expected userspace interaction with
    several drivers. An example is when removing the loopback driver, one
    uses ioctls from userspace to do so, but upon return and if successful,
    one expects the device to be removed. Likewise if one races to add another
    device the new one may not be added as it is still being removed. This was
    expected behavior before and it now fails as the device is still present
    and busy still. Moving to asynchronous request_queue removal could have
    broken many scripts which relied on the removal to have been completed if
    there was no error. Document this expectation as well so that this
    doesn't regress userspace again.

    Using asynchronous request_queue removal however has helped us find
    other bugs. In the future we can test what could break with this
    arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

    While at it, update the docs with the context expectations for the
    request_queue / gendisk refcount decrement, and make these
    expectations explicit by using might_sleep().

    Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
    Suggested-by: Nicolai Stange
    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Cc: Hannes Reinecke
    Cc: Nicolai Stange
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: yu kuai
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     
  • Let us clarify the context under which the helpers to increment the
    refcount for the gendisk and request_queue can be called under. We
    make this explicit on the places where we may sleep with might_sleep().

    We don't address the decrement context yet, as that needs some extra
    work and fixes, but will be addressed in the next patch.

    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     
  • This adds documentation for the gendisk / request_queue refcount
    helpers.

    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     

22 Jun, 2020

1 commit


03 Jun, 2020

2 commits

  • Pull block updates from Jens Axboe:
    "Core block changes that have been queued up for this release:

    - Remove dead blk-throttle and blk-wbt code (Guoqing)

    - Include pid in blktrace note traces (Jan)

    - Don't spew I/O errors on wouldblock termination (me)

    - Zone append addition (Johannes, Keith, Damien)

    - IO accounting improvements (Konstantin, Christoph)

    - blk-mq hardware map update improvements (Ming)

    - Scheduler dispatch improvement (Salman)

    - Inline block encryption support (Satya)

    - Request map fixes and improvements (Weiping)

    - blk-iocost tweaks (Tejun)

    - Fix for timeout failing with error injection (Keith)

    - Queue re-run fixes (Douglas)

    - CPU hotplug improvements (Christoph)

    - Queue entry/exit improvements (Christoph)

    - Move DMA drain handling to the few drivers that use it (Christoph)

    - Partition handling cleanups (Christoph)"

    * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
    block: mark bio_wouldblock_error() bio with BIO_QUIET
    blk-wbt: rename __wbt_update_limits to wbt_update_limits
    blk-wbt: remove wbt_update_limits
    blk-throttle: remove tg_drain_bios
    blk-throttle: remove blk_throtl_drain
    null_blk: force complete for timeout request
    blk-mq: drain I/O when all CPUs in a hctx are offline
    blk-mq: add blk_mq_all_tag_iter
    blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
    blk-mq: use BLK_MQ_NO_TAG in more places
    blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
    blk-mq: move more request initialization to blk_mq_rq_ctx_init
    blk-mq: simplify the blk_mq_get_request calling convention
    blk-mq: remove the bio argument to ->prepare_request
    nvme: force complete cancelled requests
    blk-mq: blk-mq: provide forced completion method
    block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
    block: blk-crypto-fallback: remove redundant initialization of variable err
    block: reduce part_stat_lock() scope
    block: use __this_cpu_add() instead of access by smp_processor_id()
    ...

    Linus Torvalds
     
  • Patch series "Change readahead API", v11.

    This series adds a readahead address_space operation to replace the
    readpages operation. The key difference is that pages are added to the
    page cache as they are allocated (and then looked up by the filesystem)
    instead of passing them on a list to the readpages operation and having
    the filesystem add them to the page cache. It's a net reduction in code
    for each implementation, more efficient than walking a list, and solves
    the direct-write vs buffered-read problem reported by yu kuai at
    http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com

    The only unconverted filesystems are those which use fscache. Their
    conversion is pending Dave Howells' rewrite which will make the
    conversion substantially easier. This should be completed by the end of
    the year.

    I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
    Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
    Miklos Szeredi have done a marvellous job of providing constructive
    criticism.

    These patches pass an xfstests run on ext4, xfs & btrfs with no
    regressions that I can tell (some of the tests seem a little flaky
    before and remain flaky afterwards).

    This patch (of 25):

    The readahead code is part of the page cache so should be found in the
    pagemap.h file. force_page_cache_readahead is only used within mm, so
    move it to mm/internal.h instead. Remove the parameter names where they
    add no value, and rename the ones which were actively misleading.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
    Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

29 May, 2020

1 commit


27 May, 2020

4 commits

  • We only need the stats lock (aka preempt_disable()) for updating the
    states, not for looking up or dropping the hd_struct reference.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Move the non-"new_io" branch of blk_account_io_start() into separate
    function. Fix merge accounting for discards (they were counted as write
    merges).

    The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike
    blk_account_io_start(), as there is no reason for that.

    [hch: rebased]

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • All callers are in blk-core.c, so move update_io_ticks over.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add two new helpers to simplify I/O accounting for bio based drivers.
    Currently these drivers use the generic_start_io_acct and
    generic_end_io_acct helpers which have very cumbersome calling
    conventions, don't actually return the time they started accounting,
    and try to deal with accounting for partitions, which can't happen
    for bio based drivers. The new helpers will be used to subsequently
    replace uses of the old helpers.

    The main API is the bio based wrappes in blkdev.h, but for zram
    which wants to account rw_page based I/O lower level routines are
    provided as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Christoph Hellwig