04 Mar, 2016

1 commit


23 Jan, 2016

1 commit

  • After commit e36f62042880(block: split bios to maxpossible length),
    bio can be splitted in the middle of a vector entry, then it
    is easy to split out one bio which size isn't aligned with block
    size, especially when the block size is bigger than 512.

    This patch fixes the issue by making the max io size aligned
    to logical block size.

    Fixes: e36f62042880(block: split bios to maxpossible length)
    Reported-by: Stefan Haberland
    Cc: Keith Busch
    Suggested-by: Linus Torvalds
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

13 Jan, 2016

1 commit

  • This splits bio in the middle of a vector to form the largest possible
    bio at the h/w's desired alignment, and guarantees the bio being split
    will have some data.

    The criteria for splitting is changed from the max sectors to the h/w's
    optimal sector alignment if it is provided. For h/w that advertise their
    block storage's underlying chunk size, it's a big performance win to not
    submit commands that cross them. If sector alignment is not provided,
    this patch uses the max sectors as before.

    This addresses the performance issue commit d380561113 attempted to
    fix, but was reverted due to splitting logic error.

    Signed-off-by: Keith Busch
    Cc: Jens Axboe
    Cc: Ming Lei
    Cc: Kent Overstreet
    Cc: # 4.4.x-
    Signed-off-by: Jens Axboe

    Keith Busch
     

09 Jan, 2016

1 commit

  • This reverts commit d3805611130af9b911e908af9f67a3f64f4f0914.

    If we end up splitting on the first segment, we don't adjust
    the sector count. That results in hitting a BUG() with attempting
    to split 0 sectors.

    As this is just a performance issue and not a regression since
    4.3 release, let's just rever this change. That gives us more
    time to test a real fix for 4.5, which would be marked for
    stable anyway.

    Jens Axboe
     

23 Dec, 2015

1 commit

  • For h/w that advertise their block storage's underlying chunk size, it's
    a big performance win to not submit commands that cross them. This patch
    uses that criteria if it is provided. If it is not provided, this patch
    uses the max sectors as before.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

04 Dec, 2015

1 commit

  • There is a split tracepoint that is supposed to be called when
    bio is splitted, and it was called in bio_split function until
    commit 4b1faf931650d4a35b2a ("block: Kill bio_pair_split()").
    But now, no one reports splits, so this patch adds call to
    trace_block_split back in blk_queue_split right after split.

    Signed-off-by: Mike Krinkin
    Signed-off-by: Jens Axboe

    Mike Krinkin
     

01 Dec, 2015

1 commit


24 Nov, 2015

3 commits

  • We had seen lots of reports of this kind issue, so add one
    warnning in blk-merge, then it can be triggered easily and
    avoid to depend on warning/bug from drivers.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit bdced438acd83a(block: setup bi_phys_segments after
    splitting) introduces function of computing bio->bi_phys_segments
    during bio splitting.

    Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
    arn't computed, so too many physical segments may be obtained
    for one request since both the two are used to check if one segment
    across two bios can be possible.

    This patch fixes the issue by computing the two variables in
    blk_bio_segment_split().

    Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
    always points to the iterator local variable, which is obviously
    wrong, so fix it by pointing to the local variable of 'bvprv'.

    Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
    Cc: stable@kernel.org #4.3
    Reported-by: Michael Ellerman
    Reported-by: Mark Salter
    Tested-by: Laurent Dufour
    Tested-by: Mark Salter
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Oct, 2015

2 commits


17 Sep, 2015

1 commit

  • biovecs has become immutable since v3.13, so it isn't necessary
    to allocate biovecs for the new cloned bios, then we can save
    one extra biovecs allocation/copy, and the allocation is often
    not fixed-length and a bit more expensive.

    For example, if the 'max_sectors_kb' of null blk's queue is set
    as 16(32 sectors) via sysfs just for making more splits, this patch
    can increase throught about ~70% in the sequential read test over
    null_blk(direct io, bs: 1M).

    Cc: Christoph Hellwig
    Cc: Kent Overstreet
    Cc: Ming Lin
    Cc: Dongsu Park
    Signed-off-by: Ming Lei

    This fixes a performance regression introduced by commit 54efd50bfd,
    and allows us to take full advantage of the fact that we have immutable
    bio_vecs. Hand applied, as it rejected violently with commit
    5014c311baa2.

    Signed-off-by: Jens Axboe

    Ming Lei
     

11 Sep, 2015

1 commit

  • If a driver sets the block queue virtual boundary mask, it means that
    it cannot handle gaps so we must not allow those in the integrity
    payload as well.

    Signed-off-by: Sagi Grimberg

    Fixed up by me to have duplicate integrity merge functions, depending
    on whether block integrity is enabled or not. Fixes a compilations
    issue with CONFIG_BLK_DEV_INTEGRITY unset.

    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

04 Sep, 2015

1 commit

  • We are checking for gaps to previous bio_vec, which can
    only detect back merges gaps. Moreover, at the point where
    we check for a gap, we don't know if we will attempt a back
    or a front merge. Thus, check for gap to prev in a back merge
    attempt and check for a gap to next in a front merge attempt.

    Signed-off-by: Jens Axboe
    [sagig: Minor rename change]
    Signed-off-by: Sagi Grimberg

    Jens Axboe
     

03 Sep, 2015

2 commits

  • The compiler can't figure out that bvprv is initialized whenever 'prev'
    is set to 1 as well. Use a pointer to bvprv instead, setting it to NULL
    initially, and get rid of the 'prev' tracking. This dumbs it down
    enough that gcc is happy.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull SG updates from Jens Axboe:
    "This contains a set of scatter-gather related changes/fixes for 4.3:

    - Add support for limited chaining of sg tables even for
    architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph.

    - Add sg chain support to target_rd. From Christoph.

    - Fixup open coded sg->page_link in crypto/omap-sham. From
    Christoph.

    - Fixup open coded crypto ->page_link manipulation. From Dan.

    - Also from Dan, automated fixup of manual sg_unmark_end()
    manipulations.

    - Also from Dan, automated fixup of open coded sg_phys()
    implementations.

    - From Robert Jarzmik, addition of an sg table splitting helper that
    drivers can use"

    * 'for-4.3/sg' of git://git.kernel.dk/linux-block:
    lib: scatterlist: add sg splitting function
    scatterlist: use sg_phys()
    crypto/omap-sham: remove an open coded access to ->page_link
    scatterlist: remove open coded sg_unmark_end instances
    crypto: replace scatterwalk_sg_chain with sg_chain
    target/rd: always chain S/G list
    scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN

    Linus Torvalds
     

02 Sep, 2015

1 commit


20 Aug, 2015

1 commit

  • The SG_GAPS queue flag caused checks for bio vector alignment against
    PAGE_SIZE, but the device may have different constraints. This patch
    adds a queue limits so a driver with such constraints can set to allow
    requests that would have been unnecessarily split. The new gaps check
    takes the request_queue as a parameter to simplify the logic around
    invoking this function.

    This new limit makes the queue flag redundant, so removing it and
    all usage. Device-mappers will inherit the correct settings through
    blk_stack_limits().

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

17 Aug, 2015

1 commit


14 Aug, 2015

2 commits

  • As generic_make_request() is now able to handle arbitrarily sized bios,
    it's no longer necessary for each individual block driver to define its
    own ->merge_bvec_fn() callback. Remove every invocation completely.

    Cc: Jens Axboe
    Cc: Lars Ellenberg
    Cc: drbd-user@lists.linbit.com
    Cc: Jiri Kosina
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: "Martin K. Petersen"
    Acked-by: NeilBrown (for the 'md' bits)
    Acked-by: Mike Snitzer
    Signed-off-by: Kent Overstreet
    [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
    dm-era-target, and resolve merge conflicts]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • The way the block layer is currently written, it goes to great lengths
    to avoid having to split bios; upper layer code (such as bio_add_page())
    checks what the underlying device can handle and tries to always create
    bios that don't need to be split.

    But this approach becomes unwieldy and eventually breaks down with
    stacked devices and devices with dynamic limits, and it adds a lot of
    complexity. If the block layer could split bios as needed, we could
    eliminate a lot of complexity elsewhere - particularly in stacked
    drivers. Code that creates bios can then create whatever size bios are
    convenient, and more importantly stacked drivers don't have to deal with
    both their own bio size limitations and the limitations of the
    (potentially multiple) devices underneath them. In the future this will
    let us delete merge_bvec_fn and a bunch of other code.

    We do this by adding calls to blk_queue_split() to the various
    make_request functions that need it - a few can already handle arbitrary
    size bios. Note that we add the call _after_ any call to
    blk_queue_bounce(); this means that blk_queue_split() and
    blk_recalc_rq_segments() don't need to be concerned with bouncing
    affecting segment merging.

    Some make_request_fn() callbacks were simple enough to audit and verify
    they don't need blk_queue_split() calls. The skipped ones are:

    * nfhd_make_request (arch/m68k/emu/nfblock.c)
    * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
    * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
    * brd_make_request (ramdisk - drivers/block/brd.c)
    * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
    * loop_make_request
    * null_queue_bio
    * bcache's make_request fns

    Some others are almost certainly safe to remove now, but will be left
    for future patches.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Ming Lei
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Lars Ellenberg
    Cc: drbd-user@lists.linbit.com
    Cc: Jiri Kosina
    Cc: Geoff Levand
    Cc: Jim Paris
    Cc: Philip Kelleher
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Oleg Drokin
    Cc: Andreas Dilger
    Acked-by: NeilBrown (for the 'md/md.c' bits)
    Acked-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Kent Overstreet
    [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

1 commit

  • Some places use helpers now, others don't. We only have the 'is set'
    helper, add helpers for setting and clearing flags too.

    It was a bit of a mess of atomic vs non-atomic access. With
    BIO_UPTODATE gone, we don't have any risk of concurrent access to the
    flags. So relax the restriction and don't make any of them atomic. The
    flags that do have serialization issues (reffed and chained), we
    already handle those separately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 May, 2015

1 commit


20 Mar, 2015

1 commit

  • Use the right array index to reference the last
    element of rq->biotail->bi_io_vec[]

    Signed-off-by: Wenbo Wang
    Reviewed-by: Chong Yuan
    Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists")
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Wenbo Wang
     

12 Feb, 2015

2 commits


12 Nov, 2014

1 commit


22 Oct, 2014

1 commit

  • The problem is introduced by commit 764f612c6c3c231b(blk-merge:
    don't compute bi_phys_segments from bi_vcnt for cloned bio),
    and merge is needed if number of current segment isn't less than
    max segments.

    Strictly speaking, bio->bi_vcnt shouldn't be used here since
    it may not be accurate in cases of both cloned bio or bio cloned
    from, but bio_segments() is a bit expensive, and bi_vcnt is still
    the biggest number, so the approach should work.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Oct, 2014

1 commit


27 Sep, 2014

1 commit


03 Sep, 2014

1 commit

  • QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices,
    so bio->bi_phys_segment computed may be bigger than
    queue_max_segments(q) for blk-mq devices, then drivers will
    fail to handle the case, for example, BUG_ON() in
    virtio_queue_rq() can be triggerd for virtio-blk:

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146

    This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE
    flag if the computed bio->bi_phys_segment is bigger than
    queue_max_segments(q), and the regression is caused by commit
    05f1dd53152173(block: add queue flag for disabling SG merging).

    Reported-by: Kick In
    Tested-by: Chris J Arges
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Jun, 2014

1 commit

  • Another restriction inherited for NVMe - those devices don't support
    SG lists that have "gaps" in them. Gaps refers to cases where the
    previous SG entry doesn't end on a page boundary. For NVMe, all SG
    entries must start at offset 0 (except the first) and end on a page
    boundary (except the last).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 May, 2014

1 commit

  • If devices are not SG starved, we waste a lot of time potentially
    collapsing SG segments. Enough that 1.5% of the CPU time goes
    to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
    which just returns the number of vectors in a bio instead of looping
    over all segments and checking for collapsible ones.

    Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
    merging, if they so desire.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Feb, 2014

1 commit

  • Immutable biovecs changed the way biovecs are interpreted - drivers no
    longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
    using part of an existing segment without modifying it).

    This breaks with discards and write_same bios, since for those bi_size
    has nothing to do with segments in the biovec. So for now, we need a
    fairly gross hack - we fortunately know that there will never be more
    than one segment for the entire request, so we can special case
    discard/write_same.

    Signed-off-by: Kent Overstreet
    Tested-by: Hugh Dickins
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

04 Dec, 2013

1 commit


27 Nov, 2013

1 commit


24 Nov, 2013

2 commits

  • bio_iovec_idx() and __bio_iovec() don't have any valid uses anymore -
    previous users have been converted to bio_iovec_iter() or other methods.

    __BVEC_END() has to go too - the bvec array can't be used directly for
    the last biovec because we might only be using the first portion of it,
    we have to iterate over the bvec array with bio_for_each_segment() which
    checks against the current value of bi_iter.bi_size.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe

    Kent Overstreet
     
  • More prep work for immutable biovecs - with immutable bvecs drivers
    won't be able to use the biovec directly, they'll need to use helpers
    that take into account bio->bi_iter.bi_bvec_done.

    This updates callers for the new usage without changing the
    implementation yet.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Paul Clements
    Cc: Jim Paris
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Nagalakshmi Nandigama
    Cc: Sreekanth Reddy
    Cc: support@lsi.com
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: Alexander Viro
    Cc: Steven Whitehouse
    Cc: Herton Ronaldo Krzesinski
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Matthew Wilcox
    Cc: Keith Busch
    Cc: Stephen Hemminger
    Cc: Quoc-Son Anh
    Cc: Sebastian Ott
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Seth Jennings
    Cc: "Martin K. Petersen"
    Cc: Mike Snitzer
    Cc: Vivek Goyal
    Cc: "Darrick J. Wong"
    Cc: Chris Metcalf
    Cc: Jan Kara
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: drbd-user@lists.linbit.com
    Cc: nbd-general@lists.sourceforge.net
    Cc: cbe-oss-dev@lists.ozlabs.org
    Cc: xen-devel@lists.xensource.com
    Cc: virtualization@lists.linux-foundation.org
    Cc: linux-raid@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: DL-MPTFusionLinux@lsi.com
    Cc: linux-scsi@vger.kernel.org
    Cc: devel@driverdev.osuosl.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cluster-devel@redhat.com
    Cc: linux-mm@kvack.org
    Acked-by: Geoff Levand

    Kent Overstreet