28 Oct, 2020

1 commit

  • When the bio's size reaches max_append_sectors, bio_add_hw_page returns
    0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
    result of building a small enough bio not to be split in the IO path.
    However, iov_iter is not advanced in this case, causing the same pages
    are filled for the bio again and again.

    Fix the case by properly advancing the iov_iter for already processed
    pages.

    Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND")
    Cc: stable@vger.kernel.org # 5.8+
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Naohiro Aota
    Signed-off-by: Jens Axboe

    Naohiro Aota
     

15 Oct, 2020

2 commits

  • Fix this warning:

    ./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.

    The thing is that *iter is not a valid markup.

    That seems to be a typo:
    *iter -> @iter

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Using "@bio's parent" causes the following waring:
    ./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.

    The main problem here is that this would be converted into:

    **bio**'s parent

    By kernel-doc, which is not a valid notation. It would be
    possible to use, instead, this kernel-doc markup:

    ``bio's`` parent

    Yet, here, is probably simpler to just use an altenative language:

    the parent of @bio

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

06 Oct, 2020

1 commit

  • bio_crypt_clone() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    However, bio_crypt_clone() might be called with GFP_ATOMIC via
    setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
    kcryptd_io_read() in drivers/md/dm-crypt.c.

    Neither case is currently reachable with a bio that actually has an
    encryption context. However, it's fragile to rely on this. Just make
    bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

    Reported-by: Miaohe Lin
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Snitzer
    Reviewed-by: Satya Tangirala
    Cc: Satya Tangirala
    Signed-off-by: Jens Axboe

    Eric Biggers
     

09 Sep, 2020

1 commit

  • If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
    not merging this page in this bio, then it make sense to make same_page
    also as false before returning.

    Without this patch, we hit below WARNING in iomap.
    This mostly happens with very large memory system and / or after tweaking
    vm dirty threshold params to delay writeback of dirty data.

    WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
    CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6
    Call Trace:
    __remove_mapping+0x154/0x320 (unreliable)
    iomap_releasepage+0x80/0x180
    try_to_release_page+0x94/0xe0
    invalidate_inode_page+0xc8/0x110
    invalidate_mapping_pages+0x1dc/0x540
    generic_fadvise+0x3c8/0x450
    xfs_file_fadvise+0x2c/0xe0 [xfs]
    vfs_fadvise+0x3c/0x60
    ksys_fadvise64_64+0x68/0xe0
    sys_fadvise64+0x28/0x40
    system_call_exception+0xf8/0x1c0
    system_call_common+0xf0/0x278

    Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"")
    Reported-by: Shivaprasad G Bhat
    Suggested-by: Christoph Hellwig
    Signed-off-by: Anju T Sudhakar
    Signed-off-by: Ritesh Harjani
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ritesh Harjani
     

18 Aug, 2020

1 commit

  • If we pass in an offset which is larger than PAGE_SIZE, then
    page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
    leading to a large number of bio_vecs being used. Use a slightly more
    obvious test that the two pages are compatible with each other.

    Fixes: 52d52d1c98a9 ("block: only allow contiguous page structs in a bio_vec")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Matthew Wilcox (Oracle)
     

01 Aug, 2020

1 commit


01 Jul, 2020

1 commit


29 Jun, 2020

5 commits


24 Jun, 2020

1 commit

  • Make use of the struct_size() helper instead of an open-coded version
    in order to avoid any potential type mistakes.

    This code was detected with the help of Coccinelle and, audited and
    fixed manually.

    Signed-off-by: Gustavo A. R. Silva
    Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
    Signed-off-by: Jens Axboe

    Gustavo A. R. Silva
     

05 Jun, 2020

1 commit

  • The status can be trivially derived from the bio itself. That also avoid
    callers like NVMe to incorrectly pass a blk_status_t instead of the errno,
    and the overhead of translating the blk_status_t to the errno in the I/O
    completion fast path when no tracing is enabled.

    Fixes: 35fe0d12c8a3 ("nvme: trace bio completion")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 May, 2020

2 commits


19 May, 2020

1 commit


14 May, 2020

1 commit

  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

13 May, 2020

3 commits

  • Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
    from modular code.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
    block device. This is a no-merge write operation.

    A zone append write BIO must:
    * Target a zoned block device
    * Have a sector position indicating the start sector of the target zone
    * The target zone must be a sequential write zone
    * The BIO must not cross a zone boundary
    * The BIO size must not be split to ensure that a single range of LBAs
    is written with a single command.

    Implement these checks in generic_make_request_checks() using the
    helper function blk_check_zone_append(). To avoid write append BIO
    splitting, introduce the new max_zone_append_sectors queue limit
    attribute and ensure that a BIO size is always lower than this limit.
    Export this new limit through sysfs and check these limits in bio_full().

    Also when a LLDD can't dispatch a request to a specific zone, it
    will return BLK_STS_ZONE_RESOURCE indicating this request needs to
    be delayed, e.g. because the zone it will be dispatched to is still
    write-locked. If this happens set the request aside in a local list
    to continue trying dispatching requests such as READ requests or a
    WRITE/ZONE_APPEND requests targetting other zones. This way we can
    still keep a high queue depth without starving other requests even if
    one request can't be served due to zone write-locking.

    Finally, make sure that the bio sector position indicates the actual
    write position as indicated by the device on completion.

    Signed-off-by: Keith Busch
    [ jth: added zone-append specific add_page and merge_page helpers ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
    max_sectors argument.

    This max_sectors argument can be used to specify constraints from the
    hardware.

    Signed-off-by: Christoph Hellwig
    [ jth: rebased and made public for blk-map.c ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Daniel Wagner
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Mar, 2020

1 commit

  • The bio_map_* helpers are just the low-level helpers for the
    blk_rq_map_* APIs. Move them together for better logical grouping,
    as no there isn't much overlap with other code in bio.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Mar, 2020

3 commits

  • This is bio layer functionality and not related to buffer heads.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Column "time_in_queue" in diskstats is supposed to show total waiting time
    of all requests. I.e. value should be equal to the sum of times from other
    columns. But this is not true, because column "time_in_queue" is counted
    separately in jiffies rather than in nanoseconds as other times.

    This patch removes redundant counter for "time_in_queue" and shows total
    time of read, write, discard and flush requests.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Currently io_ticks is approximated by adding one at each start and end of
    requests if jiffies counter has changed. This works perfectly for requests
    shorter than a jiffy or if one of requests starts/ends at each jiffy.

    If disk executes just one request at a time and they are longer than two
    jiffies then only first and last jiffies will be accounted.

    Fix is simple: at the end of request add up into io_ticks jiffies passed
    since last update rather than just one jiffy.

    Example: common HDD executes random read 4k requests around 12ms.

    fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
    iostat -x 10 sdb

    Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

    Before:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43

    After:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99

    Now io_ticks does not loose time between start and end of requests, but
    for queue-depth > 1 some I/O time between adjacent starts might be lost.

    For load estimation "%util" is not as useful as average queue length,
    but it clearly shows how often disk queue is completely empty.

    Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

24 Mar, 2020

1 commit


18 Mar, 2020

1 commit

  • submit_bio_wait() can be called from ioctl(BLKSECDISCARD), which
    may take long time to complete, as Salman mentioned, 4K BLKSECDISCARD
    takes up to 100 second on some devices. Also any block I/O operation
    that occurs after the BLKSECDISCARD is submitted will also potentially
    be affected by the hung task timeouts.

    Another report is that task hang can be observed when running mkfs
    over raid10 which takes a small max discard sectors limit because
    of chunk size.

    So prevent hung_check from firing by taking same approach used
    in blk_execute_rq(), and the wake-up interval is set as half the
    hung_check timer period, which keeps overhead low enough.

    Cc: Salman Qazi
    Cc: Jesse Barnes
    Cc: Bart Van Assche
    Link: https://lkml.org/lkml/2020/2/12/1193
    Reported-by: Salman Qazi
    Reviewed-by: Jesse Barnes
    Reviewed-by: Salman Qazi
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jan, 2020

1 commit

  • Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Dec, 2019

1 commit

  • Some filesystem, such as vfat, may send bio which crosses device boundary,
    and the worse thing is that the IO request starting within device boundaries
    can contain more than one segment past EOD.

    Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    tries to fix this issue by returning -EIO for this situation. However,
    this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
    may hang for ever.

    Also the current truncating on last segment is dangerous by updating the
    last bvec, given bvec table becomes not immutable any more, and fs bio
    users may not retrieve the truncated pages via bio_for_each_segment_all() in
    its .end_io callback.

    Fixes this issue by supporting multi-segment truncating. And the
    approach is simpler:

    - just update bio size since block layer can make correct bvec with
    the updated bio size. Then bvec table becomes really immutable.

    - zero all truncated segments for read bio

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Dec, 2019

1 commit

  • This partially reverts commit e3a5d8e386c3fb973fa75f2403622a8f3640ec06.

    Commit e3a5d8e386c3 ("check bi_size overflow before merge") adds a bio_full
    check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail
    when the last bi_io_vec has been reached. Instead, what we want here is only
    the bi_size overflow check.

    Fixes: e3a5d8e386c3 ("block: check bi_size overflow before merge")
    Cc: stable@vger.kernel.org # v5.4+
    Reviewed-by: Ming Lei
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Jens Axboe

    Andreas Gruenbacher
     

06 Dec, 2019

1 commit

  • 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
    bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
    and bio_endio(). This way looks wrong because bio may be freed
    without calling bio_endio(), for example, blk_rq_unprep_clone() is
    called from dm_mq_queue_rq() when the underlying queue of dm-mpath
    is busy.

    So memory leak of bio integrity data is caused by commit 7c20f11680a4.

    Fixes this issue by re-adding bio_integrity_free() to bio_uninit().

    Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by Justin Tee

    Add commit log, and simplify/fix the original patch wroten by Justin.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Justin Tee
     

12 Nov, 2019

1 commit

  • __bio_try_merge_page() may merge a page to bio without bio_full() check
    and cause bi_size overflow.

    The overflow typically ends up with sd_init_command() warning on zero
    segment request with call trace like this:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180
    CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1
    Workqueue: kblockd blk_mq_run_work_fn
    RIP: 0010:scsi_init_io+0x156/0x180
    RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246
    RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000
    RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118
    RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000
    R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000
    FS: 0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0
    Call Trace:
    sd_init_command+0x326/0xb40 [sd_mod]
    scsi_queue_rq+0x502/0xaa0
    ? blk_mq_get_driver_tag+0xe7/0x120
    blk_mq_dispatch_rq_list+0x256/0x5a0
    ? elv_rb_del+0x24/0x30
    ? deadline_remove_request+0x7b/0xc0
    blk_mq_do_dispatch_sched+0xa3/0x140
    blk_mq_sched_dispatch_requests+0xfb/0x170
    __blk_mq_run_hw_queue+0x81/0x130
    blk_mq_run_work_fn+0x1b/0x20
    process_one_work+0x179/0x390
    worker_thread+0x4f/0x3e0
    kthread+0x105/0x140
    ? max_active_store+0x80/0x80
    ? kthread_bind+0x20/0x20
    ret_from_fork+0x35/0x40
    ---[ end trace f9036abf5af4a4d3 ]---
    blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0
    XFS (sdd1): writeback error on sector 2875552

    __bio_try_merge_page() should check the overflow before actually doing
    merge.

    Fixes: 07173c3ec276c ("block: enable multipage bvecs")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

22 Aug, 2019

3 commits


14 Aug, 2019

1 commit

  • psi tracks the time tasks wait for refaulting pages to become
    uptodate, but it does not track the time spent submitting the IO. The
    submission part can be significant if backing storage is contended or
    when cgroup throttling (io.latency) is in effect - a lot of time is
    spent in submit_bio(). In that case, we underreport memory pressure.

    Annotate submit_bio() to account submission time as memory stall when
    the bio is reading userspace workingset pages.

    Tested-by: Suren Baghdasaryan
    Signed-off-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Johannes Weiner
     

06 Aug, 2019

1 commit


05 Aug, 2019

1 commit