05 Dec, 2020

1 commit

  • Commit 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account
    for target-specific splitting") caused a couple regressions:
    1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
    chunk_sectors must reflect the most limited of all devices in the
    IO stack.
    2) DM targets that set max_io_len but that do _not_ provide an
    .iterate_devices method no longer had there IO split properly.

    And commit 5091cdec56fa ("dm: change max_io_len() to use
    blk_max_size_offset()") also caused a regression where DM no longer
    supported varied (per target) IO splitting. The implication being the
    potential for severely reduced performance for IO stacks that use a DM
    target like dm-cache to hide performance limitations of a slower
    device (e.g. one that requires 4K IO splitting).

    Coming full circle: Fix all these issues by discontinuing stacking
    chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
    add optional chunk_sectors override argument to blk_max_size_offset()
    and update DM's max_io_len() to pass ti->max_io_len to its
    blk_max_size_offset() call.

    Passing in an optional chunk_sectors override to blk_max_size_offset()
    allows for code reuse of block's centralized calculation for max IO
    size based on provided offset and split boundary.

    Fixes: 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
    Fixes: 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()")
    Cc: stable@vger.kernel.org
    Reported-by: John Dorminy
    Reported-by: Bruce Johnston
    Reported-by: Kirill Tkhai
    Reviewed-by: John Dorminy
    Signed-off-by: Mike Snitzer
    Reviewed-by: Jens Axboe

    Mike Snitzer
     

06 Oct, 2020

1 commit


02 Sep, 2020

4 commits


22 Aug, 2020

1 commit

  • A previous commit aligning splits to physical block sizes inadvertently
    modified one return case such that that it now returns 0 length splits
    when the number of sectors doesn't exceed the physical offset. This
    later hits a BUG in bio_split(). Restore the previous working behavior.

    Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios")
    Reported-by: Eric Deal
    Signed-off-by: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     

17 Aug, 2020

1 commit

  • When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
    return false for discard request, then normal request merge is applied.
    However, only queue_max_segments() is checked, so max discard segment
    limit isn't respected.

    Check max discard segment limit in the request merge code for fixing
    the issue.

    Discard request failure of virtio_blk is fixed.

    Fixes: 69840466086d ("block: fix the DISCARD request merge")
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Stefano Garzarella
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

01 Jul, 2020

2 commits


26 Jun, 2020

1 commit

  • Currently blk-mq does not report any event when two requests get merged
    in the elevator. This then results in difficult to understand sequence
    of events like:

    ...
    8,0 34 1579 0.608765271 2718 I WS 215023504 + 40 [dbench]
    8,0 34 1584 0.609184613 2719 A WS 215023544 + 56
    Signed-off-by: Jens Axboe

    Jan Kara
     

27 May, 2020

2 commits


19 May, 2020

1 commit


14 May, 2020

1 commit

  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

29 Apr, 2020

1 commit


23 Apr, 2020

4 commits

  • There are only two callers of blk_rq_map_sg/__blk_rq_map_sg that set
    the dma_pad value in the queue. Move the handling into those callers
    instead of burdening the common code, and move the ->extra_len field
    from struct request to struct scsi_cmnd.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Don't burden the common block code with with specifics of the libata DMA
    draining mechanism. Instead move most of the code to the scsi midlayer.

    That also means the nr_phys_segments adjustments in the blk-mq fast path
    can go away entirely, given that SCSI never looks at nr_phys_segments
    after mapping the request to a scatterlist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • To be able to move some of the special purpose hacks in blk_rq_map_sg
    into the callers we need a variant that returns the last mapped
    S/G list element to the caller. Add that variant as __blk_rq_map_sg
    and make blk_rq_map_sg a trivial inline wrapper around it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The RQF_COPY_USER is set for bio where the passthrough request mapping
    helpers decided that bounce buffering is required. It is then used to
    pad scatterlist for drivers that required it. But given that
    non-passthrough requests are per definition aligned, and directly mapped
    pass-through request must be aligned it is not actually required at all.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Jan, 2020

1 commit

  • Commit 429120f3df2d starts to take account of segment's start dma address
    when computing max segment size, and data type of 'unsigned long'
    is used to do that. However, the segment mask may be 0xffffffff, so
    the figured out segment size may be overflowed in case of zero physical
    address on 32bit arch.

    Fix the issue by returning queue_max_segment_size() directly when that
    happens.

    Fixes: 429120f3df2d ("block: fix splitting segments on boundary masks")
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Cc: Christoph Hellwig
    Tested-by: Steven Rostedt (VMware)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Dec, 2019

1 commit

  • We ran into a problem with a mpt3sas based controller, where we would
    see random (and hard to reproduce) file corruption). The issue seemed
    specific to this controller, but wasn't specific to the file system.
    After a lot of debugging, we find out that it's caused by segments
    spanning a 4G memory boundary. This shouldn't happen, as the default
    setting for segment boundary masks is 4G.

    Turns out there are two issues in get_max_segment_size():

    1) The default segment boundary mask is bypassed

    2) The segment start address isn't taken into account when checking
    segment boundary limit

    Fix these two issues by removing the bypass of the segment boundary
    check even if the mask is set to the default value, and taking into
    account the actual start address of the request when checking if a
    segment needs splitting.

    Cc: stable@vger.kernel.org # v5.1+
    Reviewed-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei

    Dropped const on the page pointer, ppc page_to_phys() doesn't mark the
    page as const...

    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Nov, 2019

1 commit


08 Nov, 2019

2 commits

  • 64K PAGE_SIZE is popular on ARM64 or other ARCHs, and 64K has been big
    enough to break some devices probably, so change the logic to split bio
    if the only bvec's length is > SZ_4K instead of PAGE_SIZE.

    Fixes: fa5322872187 (block: avoid blk_bio_segment_split for small I/O operations)
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Some device may set segment boundary as PAGE_SIZE - 1. If the bvec
    crosses pages, and meantime its length is
    Fixes: fa5322872187 (block: avoid blk_bio_segment_split for small I/O operations)
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Nov, 2019

1 commit


05 Aug, 2019

5 commits

  • Consider the following example:
    * The logical block size is 4 KB.
    * The physical block size is 8 KB.
    * max_sectors equals (16 KB >> 9) sectors.
    * A non-aligned 4 KB and an aligned 64 KB bio are merged into a single
    non-aligned 68 KB bio.

    The current behavior is to split such a bio into (16 KB + 16 KB + 16 KB
    + 16 KB + 4 KB). The start of none of these five bio's is aligned to a
    physical block boundary.

    This patch ensures that such a bio is split into four aligned and
    one non-aligned bio instead of being split into five non-aligned bios.
    This improves performance because most block devices can handle aligned
    requests faster than non-aligned requests.

    Since the physical block size is larger than or equal to the logical
    block size, this patch preserves the guarantee that the returned
    value is a multiple of the logical block size.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the max_sectors check into bvec_split_segs() such that a single
    call to that function can do all the necessary checks. This patch
    optimizes the fast path further, namely if a bvec fits in a page.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Simplify this function by by removing two if-tests. Other than requiring
    that the @sectors pointer is not NULL, this patch does not change the
    behavior of bvec_split_segs().

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since what the bio splitting functions do is nontrivial, document these
    functions.

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Make it clear to the compiler and also to humans that the functions
    that query request queue properties do not modify any member of the
    request_queue data structure.

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

03 Jul, 2019

1 commit


21 Jun, 2019

3 commits

  • Now that we don't need to assign the front/back segment sizes, we can
    duplicating the segs assignment for the split vs no-split case and
    remove a whole chunk of boilerplate code.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Return the segement and let the callers assign them, which makes the code
    a littler more obvious. Also pass the request instead of q plus bio
    chain, allowing for the use of rq_for_each_bvec.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 May, 2019

3 commits

  • At this point these fields aren't used for anything, so we can remove
    them.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We fundamentally do not have a maximum segement size for devices with a
    virt boundary. So don't bother checking it, especially given that the
    existing checks didn't properly work to start with as we never fully
    update the front/back segment size and miss the bi_seg_front_size that
    wuld have been required for some cases.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ll_merge_requests_fn, unlike all other merge functions,
    reduces nr_phys_segments by one if the last segment of the previous,
    and the first segment of the next segement are contigous. While this
    seems like a nice solution to avoid building smaller than possible
    requests it causes a mismatch between the segments actually present
    in the request and those iterated over by the bvec iterators, including
    __rq_for_each_bio. This can for example mistrigger the single segment
    optimization in the nvme-pci driver, and might lead to mismatching
    nr_phys_segments number when recalculating the number of request
    when inserting a cloned request.

    We could possibly work around this by making the bvec iterators take
    the front and back segment size into account, but that would require
    moving them from the bio to the bio_iter and spreading this mess
    over all users of bvecs. Or we could simply remove this optimization
    under the assumption that most users already build good enough bvecs,
    and that the bio merge patch never cared about this optimization
    either. The latter is what this patch does.

    dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig