06 Oct, 2020

3 commits


02 Sep, 2020

3 commits


17 Jul, 2020

1 commit

  • This patch improves discard bio split for address and size alignment in
    __blkdev_issue_discard(). The aligned discard bio may help underlying
    device controller to perform better discard and internal garbage
    collection, and avoid unnecessary internal fragment.

    Current discard bio split algorithm in __blkdev_issue_discard() may have
    non-discarded fregment on device even the discard bio LBA and size are
    both aligned to device's discard granularity size.

    Here is the example steps on how to reproduce the above problem.
    - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
    with thin mode and give it to a Linux virtual machine.
    - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
    /dev/sdb, fill data into the first 50GB by,
    # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
    - Discard the 50GB range from offset 0 on /dev/sdb,
    # blkdiscard /dev/sdb -o 0 -l 53687091200
    - Observe the underlying mapping status of the device
    # sg_get_lba_status /dev/sdb -m 1048 --lba=0
    descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
    descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    Although the discard bio starts at LBA 0 and has 50<<< 9) > UINT_MAX);
    62
    63 bio = blk_next_bio(bio, 0, gfp_mask);
    64 bio->bi_iter.bi_sector = sector;
    65 bio_set_dev(bio, bdev);
    66 bio_set_op_attrs(bio, op, 0);
    67
    68 bio->bi_iter.bi_size = req_sects << 9;
    69 sector += req_sects;
    70 nr_sects -= req_sects;
    [snipped]
    79 }
    80
    81 *biop = bio;
    82 return 0;
    83 }
    84 EXPORT_SYMBOL(__blkdev_issue_discard);

    At line 58-59, to discard a 50GB range, req_sects is set as return value
    of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
    case, the discard granularity is 2048 sectors, although the start LBA
    and discard length are aligned to discard granularity, req_sects never
    has chance to be aligned to discard granularity. This is why there are
    some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

    If req_sects at line 58 is set to a value aligned to discard_granularity
    and close to UNIT_MAX, then all consequent split bios inside device
    driver are (almostly) aligned to discard_granularity of the device
    queue. The 2048 sectors still-mapped fragment will disappear.

    This patch introduces bio_aligned_discard_max_sectors() to return the
    the value which is aligned to q->limits.discard_granularity and closest
    to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
    this new routine to decide a more proper split bio length.

    But we still need to handle the situation when discard start LBA is not
    aligned to q->limits.discard_granularity, otherwise even the length is
    aligned, current code may still leave 2048 fragment around every 4GB
    range. Therefore, to calculate req_sects, firstly the start LBA of
    discard range is checked (including partition offset), if it is not
    aligned to discard granularity, the first split location should make
    sure following bio has bi_sector aligned to discard granularity. Then
    there won't be still-mapped fragment in the middle of the discard range.

    The above is how this patch improves discard bio alignment in
    __blkdev_issue_discard(). Now with this patch, after discard with same
    command line mentiond previously, sg_get_lba_status returns,
    descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    We an see there is no 2048 sectors segment anymore, everything is clean.

    Reported-and-tested-by: Acshai Manoj
    Signed-off-by: Coly Li
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Xiao Ni
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Enzo Matsumiya
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Coly Li
     

09 Jul, 2020

1 commit


02 Jul, 2020

1 commit

  • This reverts commits the following commits:

    37f4a24c2469a10a4c16c641671bd766e276cf9f
    723bf178f158abd1ce6069cb049581b3cb003aab
    36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

    The last one is the culprit, but we have to go a bit deeper to get this
    to revert cleanly. There's been a report that this breaks some MMC
    setups [1], and also causes an issue with swap [2]. Until this can be
    figured out, revert the offending commits.

    [1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
    [2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

    Reported-by: Marek Szyprowski
    Reported-by: Qian Cai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2020

3 commits


29 Jun, 2020

1 commit

  • blkcg_bio_issue_check is a giant inline function that does three entirely
    different things. Factor out the blk-cgroup related bio initalization
    into a new helper, and the open code the sequence in the only caller,
    relying on the fact that all the actual functionality is stubbed out for
    non-cgroup builds.

    Acked-by: Tejun Heo
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jun, 2020

2 commits

  • We were only creating the request_queue debugfs_dir only
    for make_request block drivers (multiqueue), but never for
    request-based block drivers. We did this as we were only
    creating non-blktrace additional debugfs files on that directory
    for make_request drivers. However, since blktrace *always* creates
    that directory anyway, we special-case the use of that directory
    on blktrace. Other than this being an eye-sore, this exposes
    request-based block drivers to the same debugfs fragile
    race that used to exist with make_request block drivers
    where if we start adding files onto that directory we can later
    run a race with a double removal of dentries on the directory
    if we don't deal with this carefully on blktrace.

    Instead, just simplify things by always creating the request_queue
    debugfs_dir on request_queue registration. Rename the mutex also to
    reflect the fact that this is used outside of the blktrace context.

    Signed-off-by: Luis Chamberlain
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Luis Chamberlain
     
  • Move the call to blk_should_fake_timeout out of blk_mq_complete_request
    and into the drivers, skipping call sites that are obvious error
    handlers, and remove the now superflous blk_mq_force_complete_rq helper.
    This ensures we don't keep injecting errors into completions that just
    terminate the Linux request after the hardware has been reset or the
    command has been aborted.

    Reviewed-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

05 Jun, 2020

1 commit

  • For optimized block readers not holding a mutex, the "number of sectors"
    64-bit value is protected from tearing on 32-bit architectures by a
    sequence counter.

    Disable preemption before entering that sequence counter's write side
    critical section. Otherwise, the read side can preempt the write side
    section and spin for the entire scheduler tick. If the reader belongs to
    a real-time scheduling class, it can spin forever and the kernel will
    livelock.

    Fixes: c83f6bf98dc1 ("block: add partition resize function to blkpg ioctl")
    Cc:
    Signed-off-by: Ahmed S. Darwish
    Reviewed-by: Sebastian Andrzej Siewior
    Signed-off-by: Jens Axboe

    Ahmed S. Darwish
     

30 May, 2020

1 commit


27 May, 2020

3 commits


19 May, 2020

4 commits

  • The flush_queue_delayed was introdued to hold queue if flush is
    running for non-queueable flush drive by commit 3ac0cc450870
    ("hold queue if flush is running for non-queueable flush drive"),
    but the non mq parts of the flush code had been removed by
    commit 7e992f847a08 ("block: remove non mq parts from the flush code"),
    as well as removing the usage of the flush_queue_delayed flag.
    Thus remove the unused flush_queue_delayed flag.

    Signed-off-by: Baolin Wang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Baolin Wang
     
  • part_inc_in_flight and part_dec_in_flight only have one caller each, and
    those callers are purely for bio based drivers. Merge each function into
    the only caller, and remove the superflous blk-mq checks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_make_request currently needs to grab an q_usage_counter
    reference when allocating a request. This is because the block layer
    grabs one before calling blk_mq_make_request, but also releases it as
    soon as blk_mq_make_request returns. Remove the blk_queue_exit call
    after blk_mq_make_request returns, and instead let it consume the
    reference. This works perfectly fine for the block layer caller, just
    device mapper needs an extra reference as the old problem still
    persists there. Open code blk_queue_enter_live in device mapper,
    as there should be no other callers and this allows better documenting
    why we do a non-try get.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 May, 2020

1 commit

  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

13 May, 2020

2 commits

  • Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
    max_sectors argument.

    This max_sectors argument can be used to specify constraints from the
    hardware.

    Signed-off-by: Christoph Hellwig
    [ jth: rebased and made public for blk-map.c ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Daniel Wagner
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • gendisk can't be gone when there is IO activity, so not hold
    part0's refcount in IO path.

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Yufen Yu
    Cc: Christoph Hellwig
    Cc: Hou Tao
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Apr, 2020

1 commit


21 Apr, 2020

4 commits


28 Mar, 2020

2 commits

  • The bio_map_* helpers are just the low-level helpers for the
    blk_rq_map_* APIs. Move them together for better logical grouping,
    as no there isn't much overlap with other code in bio.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Mar, 2020

2 commits


24 Mar, 2020

1 commit


12 Mar, 2020

1 commit


21 Dec, 2019

1 commit

  • Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

06 Dec, 2019

1 commit

  • 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
    bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
    and bio_endio(). This way looks wrong because bio may be freed
    without calling bio_endio(), for example, blk_rq_unprep_clone() is
    called from dm_mq_queue_rq() when the underlying queue of dm-mpath
    is busy.

    So memory leak of bio integrity data is caused by commit 7c20f11680a4.

    Fixes this issue by re-adding bio_integrity_free() to bio_uninit().

    Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by Justin Tee

    Add commit log, and simplify/fix the original patch wroten by Justin.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Justin Tee