05 Aug, 2019

5 commits

  • Consider the following example:
    * The logical block size is 4 KB.
    * The physical block size is 8 KB.
    * max_sectors equals (16 KB >> 9) sectors.
    * A non-aligned 4 KB and an aligned 64 KB bio are merged into a single
    non-aligned 68 KB bio.

    The current behavior is to split such a bio into (16 KB + 16 KB + 16 KB
    + 16 KB + 4 KB). The start of none of these five bio's is aligned to a
    physical block boundary.

    This patch ensures that such a bio is split into four aligned and
    one non-aligned bio instead of being split into five non-aligned bios.
    This improves performance because most block devices can handle aligned
    requests faster than non-aligned requests.

    Since the physical block size is larger than or equal to the logical
    block size, this patch preserves the guarantee that the returned
    value is a multiple of the logical block size.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the max_sectors check into bvec_split_segs() such that a single
    call to that function can do all the necessary checks. This patch
    optimizes the fast path further, namely if a bvec fits in a page.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Simplify this function by by removing two if-tests. Other than requiring
    that the @sectors pointer is not NULL, this patch does not change the
    behavior of bvec_split_segs().

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since what the bio splitting functions do is nontrivial, document these
    functions.

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Make it clear to the compiler and also to humans that the functions
    that query request queue properties do not modify any member of the
    request_queue data structure.

    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

03 Jul, 2019

1 commit


21 Jun, 2019

3 commits

  • Now that we don't need to assign the front/back segment sizes, we can
    duplicating the segs assignment for the split vs no-split case and
    remove a whole chunk of boilerplate code.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Return the segement and let the callers assign them, which makes the code
    a littler more obvious. Also pass the request instead of q plus bio
    chain, allowing for the use of rq_for_each_bvec.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 May, 2019

3 commits

  • At this point these fields aren't used for anything, so we can remove
    them.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We fundamentally do not have a maximum segement size for devices with a
    virt boundary. So don't bother checking it, especially given that the
    existing checks didn't properly work to start with as we never fully
    update the front/back segment size and miss the bi_seg_front_size that
    wuld have been required for some cases.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ll_merge_requests_fn, unlike all other merge functions,
    reduces nr_phys_segments by one if the last segment of the previous,
    and the first segment of the next segement are contigous. While this
    seems like a nice solution to avoid building smaller than possible
    requests it causes a mismatch between the segments actually present
    in the request and those iterated over by the bvec iterators, including
    __rq_for_each_bio. This can for example mistrigger the single segment
    optimization in the nvme-pci driver, and might lead to mismatching
    nr_phys_segments number when recalculating the number of request
    when inserting a cloned request.

    We could possibly work around this by making the bvec iterators take
    the front and back segment size into account, but that would require
    moving them from the bio to the bio_iter and spreading this mess
    over all users of bvecs. Or we could simply remove this optimization
    under the assumption that most users already build good enough bvecs,
    and that the bio merge patch never cared about this optimization
    either. The latter is what this patch does.

    dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

22 Apr, 2019

1 commit

  • While we generally allow scatterlists to have offsets larger than page
    size for an entry, and other subsystems like the crypto code make use of
    that, the block layer isn't quite ready for that. Flip the switch back
    to avoid them for now, and revisit that decision early in a merge window
    once the known offenders are fixed.

    Fixes: 8a96a0e40810 ("block: rewrite blk_bvec_map_sg to avoid a nth_page call")
    Reviewed-by: Ming Lei
    Tested-by: Guenter Roeck
    Reported-by: Guenter Roeck
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Apr, 2019

1 commit


09 Apr, 2019

1 commit

  • Commit f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can
    be mergeable") changes bvec merge by only considering two bvecs from
    different bios. However, if the former bio doesn't inlcude any io bvec,
    then the following warning may be triggered:

    warning: ‘bvec.bv_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized]

    In practice, it shouldn't be triggered.

    Fixes it by adding check on former bio, the check shouldn't add any cost
    given 'bio->bi_iter' can be hit in cache.

    Reported-by: Jens Axboe
    Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Apr, 2019

4 commits


07 Mar, 2019

1 commit

  • blk_recount_segments() can be called in bio_add_pc_page() for
    calculating how many segments this bio will has after one page is added
    to this bio. If the resulted segment number is beyond the queue limit,
    the added page will be removed.

    The try-and-fix policy requires blk_recount_segments(__blk_recalc_rq_segments)
    to not consider the segment number limit. Unfortunately bvec_split_segs()
    does check this limit, and causes small segment number returned to
    bio_add_pc_page(), then page still may be added to the bio even though
    segment number limit becomes broken.

    Fixes this issue by not considering segment number limit when calcualting
    bio's segment number.

    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Cc: Christoph Hellwig
    Cc: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Mar, 2019

1 commit

  • When the current bvec can be merged to the 1st segment, the bio's front
    segment size has to be updated.

    However, dcebd755926b doesn't consider that case, then bio's front
    segment size may not be correct.

    This patch fixes this issue.

    Cc: Christoph Hellwig
    Cc: Omar Sandoval
    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

27 Feb, 2019

3 commits


20 Feb, 2019

1 commit

  • rq->bio can be NULL sometimes, such as flush request, so don't
    read bio->bi_seg_front_size until this 'bio' is checked as valid.

    Cc: Bart Van Assche
    Reported-by: Bart Van Assche
    Fixes: dcebd755926b0f39dd1e ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

15 Feb, 2019

4 commits

  • Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
    physical segment number is mainly figured out in blk_queue_split() for
    fast path, and the flag of BIO_SEG_VALID is set there too.

    Now only blk_recount_segments() and blk_recalc_rq_segments() use this
    flag.

    Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
    is set in blk_queue_split().

    For another user of blk_recalc_rq_segments():

    - run in partial completion branch of blk_update_request, which is an unusual case

    - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
    since dm-rq is the only user.

    Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
    current setup of the I/O path, as it isn't going to save you a significant amount
    of cycles.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is more efficient to use bio_for_each_bvec() to map sg, meantime
    we have to consider splitting multipage bvec as done in blk_bio_segment_split().

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • First it is more efficient to use bio_for_each_bvec() in both
    blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
    many multi-page bvecs there are in the bio.

    Secondly once bio_for_each_bvec() is used, the bvec may need to be
    splitted because its length can be very longer than max segment size,
    so we have to split the big bvec into several segments.

    Thirdly when splitting multi-page bvec into segments, the max segment
    limit may be reached, so the bio split need to be considered under
    this situation too.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is wrong to use bio->bi_vcnt to figure out how many segments
    there are in the bio even though CLONED flag isn't set on this bio,
    because this bio may be splitted or advanced.

    So always use bio_segments() in blk_recount_segments(), and it shouldn't
    cause any performance loss now because the physical segment number is figured
    out in blk_queue_split() and BIO_SEG_VALID is set meantime since
    bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max segments")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

27 Jan, 2019

1 commit


23 Jan, 2019

1 commit

  • Except for blk_queue_split(), bio_split() is used for splitting bio too,
    then the remained bio is often resubmit to queue via generic_make_request().
    So the same queue enter recursion exits in this case too. Unfortunatley
    commit cd4a4ae4683dc2 doesn't help this case.

    This patch covers the above case by setting BIO_QUEUE_ENTERED before calling
    q->make_request_fn.

    In theory the per-bio flag is used to simulate one stack variable, it is
    just fine to clear it after q->make_request_fn is returned. Especially
    the same bio can't be submitted from another context.

    Fixes: cd4a4ae4683dc2 ("block: don't use blocking queue entered for recursive bio submits")
    Cc: Tetsuo Handa
    Cc: NeilBrown
    Reviewed-by: Mike Snitzer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Dec, 2018

1 commit

  • Pull SCSI updates from James Bottomley:
    "This is mostly update of the usual drivers: smarpqi, lpfc, qedi,
    megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas.

    Additionally, we have a pile of annotation, unused variable and minor
    updates.

    The big API change is the updates for Christoph's DMA rework which
    include removing the DISABLE_CLUSTERING flag.

    And finally there are a couple of target tree updates"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits)
    scsi: isci: request: mark expected switch fall-through
    scsi: isci: remote_node_context: mark expected switch fall-throughs
    scsi: isci: remote_device: Mark expected switch fall-throughs
    scsi: isci: phy: Mark expected switch fall-through
    scsi: iscsi: Capture iscsi debug messages using tracepoints
    scsi: myrb: Mark expected switch fall-throughs
    scsi: megaraid: fix out-of-bound array accesses
    scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through
    scsi: fcoe: remove set but not used variable 'port'
    scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown()
    scsi: smartpqi: fix build warnings
    scsi: smartpqi: update driver version
    scsi: smartpqi: add ofa support
    scsi: smartpqi: increase fw status register read timeout
    scsi: smartpqi: bump driver version
    scsi: smartpqi: add smp_utils support
    scsi: smartpqi: correct lun reset issues
    scsi: smartpqi: correct volume status
    scsi: smartpqi: do not offline disks for transient did no connect conditions
    scsi: smartpqi: allow for larger raid maps
    ...

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • Now that the the SCSI layer replaced the use of the cluster flag with
    segment size limits and the DMA boundary we can remove the cluster flag
    from the block layer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

14 Dec, 2018

1 commit


10 Dec, 2018

2 commits

  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • All of part_stat_* and related methods are used with preempt disabled,
    so there is no need to pass cpu around to allow of them. Just call
    smp_processor_id() as needed.

    Suggested-by: Jens Axboe
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

05 Dec, 2018

1 commit

  • Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
    also getting the merge fix that went into mainline that users are
    hitting testing for-4.21/block and/or for-next.

    * tag 'v4.20-rc5': (664 commits)
    Linux 4.20-rc5
    PCI: Fix incorrect value returned from pcie_get_speed_cap()
    MAINTAINERS: Update linux-mips mailing list address
    ocfs2: fix potential use after free
    mm/khugepaged: fix the xas_create_range() error path
    mm/khugepaged: collapse_shmem() do not crash on Compound
    mm/khugepaged: collapse_shmem() without freezing new_page
    mm/khugepaged: minor reorderings in collapse_shmem()
    mm/khugepaged: collapse_shmem() remember to clear holes
    mm/khugepaged: fix crashes due to misaccounted holes
    mm/khugepaged: collapse_shmem() stop if punched or truncated
    mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
    mm/huge_memory: splitting set mapping+index before unfreeze
    mm/huge_memory: rename freeze_page() to unmap_page()
    initramfs: clean old path before creating a hardlink
    kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
    psi: make disabling/enabling easier for vendor kernels
    proc: fixup map_files test on arm
    debugobjects: avoid recursive calls with kmemleak
    userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
    ...

    Jens Axboe
     

01 Dec, 2018

1 commit

  • There are actually two kinds of discard merge:

    - one is the normal discard merge, just like normal read/write request,
    and call it single-range discard

    - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1

    For the former case, queue_max_discard_segments(rq->q) is 1, and we
    should handle this kind of discard merge like the normal read/write
    request.

    This patch fixes the following kernel panic issue[1], which is caused by
    not removing the single-range discard request from elevator queue.

    Guangwu has one raid discard test case, in which this issue is a bit
    easier to trigger, and I verified that this patch can fix the kernel
    panic issue in Guangwu's test case.

    [1] kernel panic log from Jens's report

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    PGD 0 P4D 0.
    Oops: 0000 [#1] SMP PTI
    CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
    4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \
    Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \
    blk_mq_run_work_fn RIP: \
    0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \
    10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
    0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
    f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \
    RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
    RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
    RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
    R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
    R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
    FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    blk_mq_dispatch_rq_list+0xec/0x480
    ? elv_rb_del+0x11/0x30
    blk_mq_do_dispatch_sched+0x6e/0xf0
    blk_mq_sched_dispatch_requests+0xfa/0x170
    __blk_mq_run_hw_queue+0x5f/0xe0
    process_one_work+0x154/0x350
    worker_thread+0x46/0x3c0
    kthread+0xf5/0x130
    ? process_one_work+0x350/0x350
    ? kthread_destroy_worker+0x50/0x50
    ret_from_fork+0x1f/0x30
    Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
    kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
    cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
    button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
    nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \

    ---[ end trace 340a1fb996df1b9b ]---
    RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
    Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
    00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 \
    20 72 37 f6 87 b0 00 00 00 02

    Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached")
    Reported-by: Jens Axboe
    Cc: Guangwu Zhang
    Cc: Christoph Hellwig
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

20 Nov, 2018

1 commit

  • Growing in size a high priority request by merging it with a lower
    priority BIO or request will increase the request execution time. This
    is the opposite result of the desired effect of high I/O priorities,
    namely getting low I/O latencies. Prevent merging of requests and BIOs
    that have different I/O priorities to fix this.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

19 Nov, 2018

1 commit

  • Merge in -rc3 to resolve a few conflicts, but also to get a few
    important fixes that have gone into mainline since the block
    4.21 branch was forked off (most notably the SCSI queue issue,
    which is both a conflict AND needed fix).

    Signed-off-by: Jens Axboe

    Jens Axboe