26 Sep, 2018

1 commit

  • [ Upstream commit 42c9cdfe1e11e083dceb0f0c4977b758cf7403b9 ]

    Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
    that blk_stack_limits() can stack up this limit for stacked devices.

    before:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    1

    after:

    $ cat /sys/block/nvme0n1/queue/max_discard_segments
    256
    $ cat /sys/block/dm-0/queue/max_discard_segments
    256

    Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

24 Aug, 2017

1 commit


28 Jun, 2017

1 commit


09 Apr, 2017

1 commit


09 Feb, 2017

1 commit

  • Add a new merge strategy that merges discard bios into a request until the
    maximum number of discard ranges (or the maximum discard size) is reached
    from the plug merging code. I/O scheduler merging is not wired up yet
    but might also be useful, although not for fast devices like NVMe which
    are the only user for now.

    Note that for now we don't support limiting the size of each discard range,
    but if needed that can be added later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Feb, 2017

1 commit

  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

13 Dec, 2016

1 commit

  • We ran into a funky issue, where someone doing 256K buffered reads saw
    128K requests at the device level. Turns out it is read-ahead capping
    the request size, since we use 128K as the default setting. This
    doesn't make a lot of sense - if someone is issuing 256K reads, they
    should see 256K reads, regardless of the read-ahead setting, if the
    underlying device can support a 256K read in a single command.

    This patch introduces a bdi hint, io_pages. This is the soft max IO
    size for the lower level, I've hooked it up to the bdev settings here.
    Read-ahead is modified to issue the maximum of the user request size,
    and the read-ahead max size, but capped to the max request size on the
    device side. The latter is done to avoid reading ahead too much, if the
    application asks for a huge read. With this patch, the kernel behaves
    like the application expects.

    Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
    Signed-off-by: Jens Axboe
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

01 Dec, 2016

1 commit

  • This adds a new block layer operation to zero out a range of
    LBAs. This allows to implement zeroing for devices that don't use
    either discard with a predictable zero pattern or WRITE SAME of zeroes.
    The prominent example of that is NVMe with the Write Zeroes command,
    but in the future, this should also help with improving the way
    zeroing discards work. For this operation, suitable entry is exported in
    sysfs which indicate the number of maximum bytes allowed in one
    write zeroes operation by the device.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

11 Nov, 2016

1 commit

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Nov, 2016

1 commit

  • For blk-mq, ->nr_requests does track queue depth, at least at init
    time. But for the older queue paths, it's simply a soft setting.
    On top of that, it's generally larger than the hardware setting
    on purpose, to allow backup of requests for merging.

    Fill a hole in struct request with a 'queue_depth' member, that
    drivers can call to more closely inform the block layer of the
    real queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara

    Jens Axboe
     

19 Oct, 2016

2 commits

  • Signed-off-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Shaun Tancheff
    Tested-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • Add the zoned queue limit to indicate the zoning model of a block device.
    Defined values are 0 (BLK_ZONED_NONE) for regular block devices,
    1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM)
    for host-managed zone block devices. The standards defined drive managed
    model is not defined here since these block devices do not provide any
    command for accessing zone information. Drive managed model devices will
    be reported as BLK_ZONED_NONE.

    The helper functions blk_queue_zoned_model and bdev_zoned_model return
    the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned
    return a boolean for callers to test if a block device is zoned.

    The zoned attribute is also exported as a string to applications via
    sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and
    BLK_ZONED_HM as "host-managed".

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Shaun Tancheff
    Tested-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

14 Apr, 2016

1 commit


13 Apr, 2016

2 commits

  • We don't have any drivers left using it, so kill it off. Update
    documentation to use the newer blk_queue_write_cache().

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     
  • Add an internal helper and flag for setting whether a queue has
    write back caching, or write through (or none). Add a sysfs file
    to show this as well, and make it changeable from user space.

    This will replace the (awkward) blk_queue_flush() interface that
    drivers currently use to inform the block layer of write cache state
    and capabilities.

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Feb, 2016

1 commit

  • The new queue limit is not used by the majority of block drivers, and
    should be initialized to 0 for the driver's requested settings to be used.

    Signed-off-by: Keith Busch
    Acked-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

26 Nov, 2015

1 commit

  • Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
    had the unfortunate side-effect of removing an implicit clamp to
    BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
    code. This caused problems for some SMR drives.

    Debugging this issue revealed a few problems with the existing
    infrastructure since the block layer didn't know how to deal with
    device-imposed limits, only limits set by the I/O controller.

    - Introduce a new queue limit, max_dev_sectors, which is used by the
    ULD to signal the maximum sectors for a REQ_TYPE_FS request.

    - Ensure that max_dev_sectors is correctly stacked and taken into
    account when overriding max_sectors through sysfs.

    - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
    values for later processing.

    - In sd_revalidate() set the queue's max_dev_sectors based on the
    MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
    is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
    field size.

    - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
    VPD--if reported and sane--to signal the preferred device transfer
    size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

    - blk_limits_max_hw_sectors() is no longer used and can be removed.

    Signed-off-by: Martin K. Petersen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
    Reviewed-by: Christoph Hellwig
    Tested-by: sweeneygj@gmx.com
    Tested-by: Arzeets
    Tested-by: David Eisner
    Tested-by: Mario Kicherer
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     

03 Sep, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

20 Aug, 2015

1 commit

  • The SG_GAPS queue flag caused checks for bio vector alignment against
    PAGE_SIZE, but the device may have different constraints. This patch
    adds a queue limits so a driver with such constraints can set to allow
    requests that would have been unnecessarily split. The new gaps check
    takes the request_queue as a parameter to simplify the logic around
    invoking this function.

    This new limit makes the queue flag redundant, so removing it and
    all usage. Device-mappers will inherit the correct settings through
    blk_stack_limits().

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 Aug, 2015

1 commit

  • This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc.
    That commit caused performance regressions for streaming I/O
    workloads on a number of different storage devices, from
    SATA disks to external RAID arrays. It also managed to
    trip up some buggy firmware in at least one drive, causing
    data corruption.

    The next patch will bump the default max_sectors_kb value to
    1280, which will accommodate a 10-data-disk stripe write
    with chunk size 128k. In the testing I've done using iozone,
    fio, and aio-stress, a value of 1280 does not show a big
    performance difference from 512. This will hopefully still
    help the software RAID setup that Christoph saw the original
    performance gains with while still not regressing other
    storage configurations.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

14 Aug, 2015

1 commit

  • As generic_make_request() is now able to handle arbitrarily sized bios,
    it's no longer necessary for each individual block driver to define its
    own ->merge_bvec_fn() callback. Remove every invocation completely.

    Cc: Jens Axboe
    Cc: Lars Ellenberg
    Cc: drbd-user@lists.linbit.com
    Cc: Jiri Kosina
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: "Martin K. Petersen"
    Acked-by: NeilBrown (for the 'md' bits)
    Acked-by: Mike Snitzer
    Signed-off-by: Kent Overstreet
    [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
    dm-era-target, and resolve merge conflicts]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

13 Aug, 2015

1 commit

  • Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
    size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
    LIMITS VPD. This had the unfortunate effect of also limiting the maximum
    size of non-filesystem requests sent to the device through sg/bsg.

    Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
    limit directly.

    Also update the comment in blk_limits_max_hw_sectors() to clarify that
    max_hw_sectors defines the limit for the I/O controller only.

    Signed-off-by: Martin K. Petersen
    Reported-by: Brian King
    Tested-by: Brian King
    Cc: stable@vger.kernel.org # 3.17+
    Signed-off-by: James Bottomley

    Martin K. Petersen
     

17 Jul, 2015

1 commit

  • Lots of devices support huge discard sizes these days. Depending
    on how the device handles them internally, huge discards can
    introduce massive latencies (hundreds of msec) on the device side.

    We have a sysfs file, discard_max_bytes, that advertises the max
    hardware supported discard size. Make this writeable, and split
    the settings into a soft and hard limit. This can be set from
    'discard_granularity' and up to the hardware limit.

    Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
    set limit.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Mar, 2015

1 commit

  • Linux 3.19 commit 69c953c ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
    caused blk_stack_limits() to not properly stack queue_limits for stacked
    devices (e.g. DM).

    Fix this regression by establishing lcm_not_zero() and switching
    blk_stack_limits() over to using it.

    DM uses blk_set_stacking_limits() to establish the initial top-level
    queue_limits that are then built up based on underlying devices' limits
    using blk_stack_limits(). In the case of optimal_io_size (io_opt)
    blk_set_stacking_limits() establishes a default value of 0. With commit
    69c953c, lcm(0, n) is no longer n, which compromises proper stacking of
    the underlying devices' io_opt.

    Test:
    $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
    $ cat /sys/block/sde/queue/optimal_io_size
    786432
    $ dmsetup create node --table "0 100 linear /dev/sde 0"

    Before this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    0

    After this fix:
    $ cat /sys/block/dm-5/queue/optimal_io_size
    786432

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 3.19+
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

22 Oct, 2014

1 commit

  • Set max_sectors to the value the drivers provides as hardware limit by
    default. Linux had proper I/O throttling for a long time and doesn't
    rely on a artifically small maximum I/O size anymore. By not limiting
    the I/O size by default we remove an annoying tuning step required for
    most Linux installation.

    Note that both the user, and if absolutely required the driver can still
    impose a limit for FS requests below max_hw_sectors_kb.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Oct, 2014

1 commit

  • The math in both blk_stack_limits() and queue_limit_alignment_offset()
    assume that a block device's io_min (aka minimum_io_size) is always a
    power-of-2. Fix the math such that it works for non-power-of-2 io_min.

    This issue (of alignment_offset != 0) became apparent when testing
    dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
    1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
    block size") unlocked the potential for alignment_offset != 0 due to
    the dm-thin-pool's io_min possibly being a non-power-of-2.

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

11 Jun, 2014

1 commit

  • With commit 762380ad9322 added support for chunk sizes and no merging
    across them, it broke the rule of always allowing adding of a single
    page to an empty bio. So relax the restriction a bit to allow for that,
    similarly to what we have always done.

    This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe.

    Reported-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Jun, 2014

1 commit

  • Some drivers have different limits on what size a request should
    optimally be, depending on the offset of the request. Similar to
    dividing a device into chunks. Add a setting that allows the driver
    to inform the block layer of such a chunk size. The block layer will
    then prevent merging across the chunks.

    This is needed to optimally support NVMe with a non-zero stripe size.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jan, 2014

1 commit

  • Now that we've got code for raid5/6 stripe awareness, bcache just needs
    to know about the stripes and when writing partial stripes is expensive
    - we probably don't want to enable this optimization for raid1 or 10,
    even though they have stripes. So add a flag to queue_limits.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

14 Nov, 2013

1 commit

  • Pull block IO core updates from Jens Axboe:
    "This is the pull request for the core changes in the block layer for
    3.13. It contains:

    - The new blk-mq request interface.

    This is a new and more scalable queueing model that marries the
    best part of the request based interface we currently have (which
    is fully featured, but scales poorly) and the bio based "interface"
    which the new drivers for high IOPS devices end up using because
    it's much faster than the request based one.

    The bio interface has no block layer support, since it taps into
    the stack much earlier. This means that drivers end up having to
    implement a lot of functionality on their own, like tagging,
    timeout handling, requeue, etc. The blk-mq interface provides all
    these. Some drivers even provide a switch to select bio or rq and
    has code to handle both, since things like merging only works in
    the rq model and hence is faster for some workloads. This is a
    huge mess. Conversion of these drivers nets us a substantial code
    reduction. Initial results on converting SCSI to this model even
    shows an 8x improvement on single queue devices. So while the
    model was intended to work on the newer multiqueue devices, it has
    substantial improvements for "classic" hardware as well. This code
    has gone through extensive testing and development, it's now ready
    to go. A pull request is coming to convert virtio-blk to this
    model will be will be coming as well, with more drivers scheduled
    for 3.14 conversion.

    - Two blktrace fixes from Jan and Chen Gang.

    - A plug merge fix from Alireza Haghdoost.

    - Conversion of __get_cpu_var() from Christoph Lameter.

    - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

    - A fix for a race between request completion and the timeout
    handling from Jeff Moyer. This is what caused the merge conflict
    with blk-mq/core, in case you are looking at that.

    - A dm stacking fix from Mike Snitzer.

    - A code consolidation fix and duplicated code removal from Kent
    Overstreet.

    - A handful of block bug fixes from Mikulas Patocka, fixing a loop
    crash and memory corruption on blk cg.

    - Elevator switch bug fix from Tomoki Sekiyama.

    A heads-up that I had to rebase this branch. Initially the immutable
    bio_vecs had been queued up for inclusion, but a week later, it became
    clear that it wasn't fully cooked yet. So the decision was made to
    pull this out and postpone it until 3.14. It was a straight forward
    rebase, just pruning out the immutable series and the later fixes of
    problems with it. The rest of the patches applied directly and no
    further changes were made"

    * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: Do not call sector_div() with a 64-bit divisor
    kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
    block: Consolidate duplicated bio_trim() implementations
    block: Use rw_copy_check_uvector()
    block: Enable sysfs nomerge control for I/O requests in the plug list
    block: properly stack underlying max_segment_size to DM device
    elevator: acquire q->sysfs_lock in elevator_change()
    elevator: Fix a race in elevator switching and md device initialization
    block: Replace __get_cpu_var uses
    bdi: test bdi_init failure
    block: fix a probe argument to blk_register_region
    loop: fix crash if blk_alloc_queue fails
    blk-core: Fix memory corruption if blkcg_init_queue fails
    block: fix race between request completion and timeout handling
    blktrace: Send BLK_TN_PROCESS events to all running traces
    blk-mq: don't disallow request merges for req->special being set
    blk-mq: mq plug list breakage
    blk-mq: fix for flush deadlock
    ...

    Linus Torvalds
     

09 Nov, 2013

1 commit

  • Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
    (65536) even if the underlying device(s) have a larger value -- this is
    due to blk_stack_limits() using min_not_zero() when stacking the
    max_segment_size limit.

    1073741824

    before patch:
    65536

    after patch:
    1073741824

    Reported-by: Lukasz Flis
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # v3.3+
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

31 Oct, 2013

1 commit


15 Dec, 2012

1 commit


20 Sep, 2012

1 commit

  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

01 Aug, 2012

1 commit

  • blk_set_stacking_limits is intended to allow stacking drivers to build
    up the limits of the stacked device based on the underlying devices'
    limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
    doesn't allow the stacking driver to inherit a max_sectors larger than
    1024 -- due to blk_stack_limits' use of min_not_zero.

    It is now clear that this artificial limit is getting in the way so
    change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
    stacking drivers like dm-multipath to inherit 'max_sectors' from the
    underlying paths).

    Reported-by: Vijay Chauhan
    Tested-by: Vijay Chauhan
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

11 Jan, 2012

1 commit

  • Stacking driver queue limits are typically bounded exclusively by the
    capabilities of the low level devices, not by the stacking driver
    itself.

    This patch introduces blk_set_stacking_limits() which has more liberal
    metrics than the default queue limits function. This allows us to
    inherit topology parameters from bottom devices without manually
    tweaking the default limits in each driver prior to calling the stacking
    function.

    Since there is now a clear distinction between stacking and low-level
    devices, blk_set_default_limits() has been modified to carry the more
    conservative values that we used to manually set in
    blk_queue_make_request().

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

18 May, 2011

1 commit

  • In some cases we would end up stacking discard_zeroes_data incorrectly.
    Fix this by enabling the feature by default for stacking drivers and
    clearing it for low-level drivers. Incorporating a device that does not
    support dzd will then cause the feature to be disabled in the stacking
    driver.

    Also ensure that the maximum discard value does not overflow when
    exported in sysfs and return 0 in the alignment and dzd fields for
    devices that don't support discard.

    Reported-by: Lukas Czerner
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

07 May, 2011

1 commit

  • flush request isn't queueable in some drives. Add a flag to let driver
    notify block layer about this. We can optimize flush performance with the
    knowledge.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com