27 Oct, 2021

1 commit

  • Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") modified
    the method to check partition existence in host-aware zoned block
    devices from disk_has_partitions() helper function call to empty check
    of xarray disk->part_tbl. However, disk->part_tbl always has single
    entry for disk->part0 and never becomes empty. This resulted in the
    host-aware zoned devices always judged to have partitions, and it made
    the sysfs queue/zoned attribute to be "none" instead of "host-aware"
    regardless of partition existence in the devices.

    This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
    sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
    device. Since block layer handled the host-aware zoned devices as non-
    zoned devices, scsi layer did not have chance to initialize the mutex
    for zone revalidation. Therefore, the warning was triggered.

    To fix the issues, call the helper function disk_has_partitions() in
    place of disk->part_tbl empty check. Since the function was removed with
    the commit a33df75c6328, reimplement it to walk through entries in the
    xarray disk->part_tbl.

    Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl")
    Signed-off-by: Shin'ichiro Kawasaki
    Cc: stable@vger.kernel.org # v5.14+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com
    Signed-off-by: Jens Axboe

    Shin'ichiro Kawasaki
     

24 Aug, 2021

1 commit

  • Replace the magic lookup through the kobject tree with an explicit
    backpointer, given that the device model links are set up and torn
    down at times when I/O is still possible, leading to potential
    NULL or invalid pointer dereferences.

    Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
    Reported-by: syzbot
    Signed-off-by: Christoph Hellwig
    Tested-by: Sven Schnelle
    Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Aug, 2021

2 commits

  • The backing device information only makes sense for file system I/O,
    and thus belongs into the gendisk and not the lower level request_queue
    structure. Move it there.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • .. and rename the function to disk_update_readahead. This is in
    preparation for moving the BDI from the request_queue to the gendisk.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Aug, 2021

1 commit


10 May, 2021

1 commit


09 May, 2021

1 commit

  • This reverts commit cd2c7545ae1beac3b6aae033c7f31193b3255946.

    Alex reports that the commit causes corruption with LUKS on ext4. Revert
    it for now so that this can be investigated properly.

    Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/
    Reported-by: Alex Xu (Hello71)
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 May, 2021

1 commit

  • Pull block fixes from Jens Axboe:

    - dasd spelling fixes (Bhaskar)

    - Limit bio max size on multi-page bvecs to the hardware limit, to
    avoid overly large bio's (and hence latencies). Originally queued for
    the merge window, but needed a fix and was dropped from the initial
    pull (Changheun)

    - NVMe pull request (Christoph):
    - reset the bdev to ns head when failover (Daniel Wagner)
    - remove unsupported command noise (Keith Busch)
    - misc passthrough improvements (Kanchan Joshi)
    - fix controller ioctl through ns_head (Minwoo Im)
    - fix controller timeouts during reset (Tao Chiu)

    - rnbd fixes/cleanups (Gioh, Md, Dima)

    - Fix iov_iter re-expansion (yangerkun)

    * tag 'block-5.13-2021-05-07' of git://git.kernel.dk/linux-block:
    block: reexpand iov_iter after read/write
    nvmet: remove unsupported command noise
    nvme-multipath: reset bdev to ns head when failover
    nvme-pci: fix controller reset hang when racing with nvme_timeout
    nvme: move the fabrics queue ready check routines to core
    nvme: avoid memset for passthrough requests
    nvme: add nvme_get_ns helper
    nvme: fix controller ioctl through ns_head
    bio: limit bio max size
    RDMA/rtrs: fix uninitialized symbol 'cnt'
    s390: dasd: Mundane spelling fixes
    block/rnbd: Remove all likely and unlikely
    block/rnbd-clt: Check the return value of the function rtrs_clt_query
    block/rnbd: Fix style issues
    block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t

    Linus Torvalds
     

07 May, 2021

1 commit

  • My UEK-derived config has 1030 files depending on pagemap.h before this
    change. Afterwards, just 326 files need to be rebuilt when I touch
    pagemap.h. I think blkdev.h is probably included too widely, but
    untangling that dependency is harder and this solves my problem. x86
    allmodconfig builds, but there may be implicit include problems on other
    architectures.

    Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Dan Williams [nvdimm]
    Acked-by: Jens Axboe [block]
    Reviewed-by: Christoph Hellwig
    Acked-by: Coly Li [bcache]
    Acked-by: Martin K. Petersen [scsi]
    Reviewed-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

04 May, 2021

1 commit

  • bio size can grow up to 4GB when muli-page bvec is enabled.
    but sometimes it would lead to inefficient behaviors.
    in case of large chunk direct I/O, - 32MB chunk read in user space -
    all pages for 32MB would be merged to a bio structure if the pages
    physical addresses are contiguous. it makes some delay to submit
    until merge complete. bio max size should be limited to a proper size.

    When 32MB chunk read with direct I/O option is coming from userspace,
    kernel behavior is below now in do_direct_IO() loop. it's timeline.

    | bio merge for 32MB. total 8,192 pages are merged.
    | total elapsed time is over 2ms.
    |------------------ ... ----------------------->|
    | 8,192 pages merged a bio.
    | at this time, first bio submit is done.
    | 1 bio is split to 32 read request and issue.
    |--------------->
    |--------------->
    |--------------->
    ......
    |--------------->
    |--------------->|
    total 19ms elapsed to complete 32MB read done from device. |

    If bio max size is limited with 1MB, behavior is changed below.

    | bio merge for 1MB. 256 pages are merged for each bio.
    | total 32 bio will be made.
    | total elapsed time is over 2ms. it's same.
    | but, first bio submit timing is fast. about 100us.
    |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
    | 256 pages merged a bio.
    | at this time, first bio submit is done.
    | and 1 read request is issued for 1 bio.
    |--------------->
    |--------------->
    |--------------->
    ......
    |--------------->
    |--------------->|
    total 17ms elapsed to complete 32MB read done from device. |

    As a result, read request issue timing is faster if bio max size is limited.
    Current kernel behavior with multipage bvec, super large bio can be created.
    And it lead to delay first I/O request issue.

    Signed-off-by: Changheun Lee
    Reviewed-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
    Signed-off-by: Jens Axboe

    Changheun Lee
     

06 Apr, 2021

2 commits

  • Get rid of all the PFN arithmetics and just use an enum for the two
    remaining options, and use PageHighMem for the actual bounce decision.

    Add a fast path to entirely avoid the call for the common case of a queue
    not using the legacy bouncing code.

    Signed-off-by: Christoph Hellwig
    Acked-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Remove the BLK_BOUNCE_ISA support now that all users are gone.

    Signed-off-by: Christoph Hellwig
    Acked-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Feb, 2021

1 commit

  • We get I/O errors when we run md-raid1 on the top of dm-integrity on the
    top of ramdisk.
    device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
    device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
    device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
    device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
    device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
    device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
    device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
    device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb

    The ramdisk device has logical_block_size 512 and max_sectors 255. The
    dm-integrity device uses logical_block_size 4096 and it doesn't affect the
    "max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
    a device with max_sectors not aligned on logical_block_size.

    The md-raid device sees that the underlying leg has max_sectors 255 and it
    will split the bios on 255-sector boundary, making the bios unaligned on
    logical_block_size.

    In order to fix the bug, we round down max_sectors to logical_block_size.

    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

10 Feb, 2021

2 commits

  • Introduce the internal function blk_queue_clear_zone_settings() to
    cleanup all limits and resources related to zoned block devices. This
    new function is called from blk_queue_set_zoned() when a disk zoned
    model is set to BLK_ZONED_NONE. This particular case can happens when a
    partition is created on a host-aware scsi disk.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
    all writes into sequential write required zones be aligned to the device
    physical block size. However, NVMe ZNS does not have this constraint and
    allows write operations into sequential zones to be aligned to the
    device logical block size. This inconsistency does not help with
    software portability across device types.

    To solve this, introduce the zone_write_granularity queue limit to
    indicate the alignment constraint, in bytes, of write operations into
    zones of a zoned block device. This new limit is exported as a
    read-only sysfs queue attribute and the helper
    blk_queue_zone_write_granularity() introduced for drivers to set this
    limit.

    The function blk_queue_set_zoned() is modified to set this new limit to
    the device logical block size by default. NVMe ZNS devices as well as
    zoned nullb devices use this default value as is. The scsi disk driver
    is modified to execute the blk_queue_zone_write_granularity() helper to
    set the zone write granularity of host-managed SMR disks to the disk
    physical block size.

    The accessor functions queue_zone_write_granularity() and
    bdev_zone_write_granularity() are also introduced.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

25 Jan, 2021

1 commit


17 Dec, 2020

1 commit

  • Pull block driver updates from Jens Axboe:
    "Nothing major in here:

    - NVMe pull request from Christoph:
    - nvmet passthrough improvements (Chaitanya Kulkarni)
    - fcloop error injection support (James Smart)
    - read-only support for zoned namespaces without Zone Append
    (Javier González)
    - improve some error message (Minwoo Im)
    - reject I/O to offline fabrics namespaces (Victor Gladkov)
    - PCI queue allocation cleanups (Niklas Schnelle)
    - remove an unused allocation in nvmet (Amit Engel)
    - a Kconfig spelling fix (Colin Ian King)
    - nvme_req_qid simplication (Baolin Wang)

    - MD pull request from Song:
    - Fix race condition in md_ioctl() (Dae R. Jeong)
    - Initialize read_slot properly for raid10 (Kevin Vigor)
    - Code cleanup (Pankaj Gupta)
    - md-cluster resync/reshape fix (Zhao Heming)

    - Move null_blk into its own directory (Damien Le Moal)

    - null_blk zone and discard improvements (Damien Le Moal)

    - bcache race fix (Dongsheng Yang)

    - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
    Lutz Pogrell, Md Haris Iqbal)

    - lightnvm NULL pointer deref fix (tangzhenhao)

    - sr in_interrupt() removal (Sebastian Andrzej Siewior)

    - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
    Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
    as it made it easier for them to funnel the feature through the
    block driver tree.

    - Follow up fixes (Colin Ian King)"

    * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
    block: drop dead assignments in loop_init()
    sr: Remove in_interrupt() usage in sr_init_command().
    sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
    cdrom: Reset sector_size back it is not 2048.
    drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
    null_blk: Move driver into its own directory
    null_blk: Allow controlling max_hw_sectors limit
    null_blk: discard zones on reset
    null_blk: cleanup discard handling
    null_blk: Improve implicit zone close
    null_blk: improve zone locking
    block: Align max_hw_sectors to logical blocksize
    null_blk: Fail zone append to conventional zones
    null_blk: Fix zone size initialization
    bcache: fix race between setting bdev state to none and new write request direct to backing
    block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
    block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
    block/rnbd: call kobject_put in the failure path
    Documentation/ABI/rnbd-srv: add document for force_close
    block/rnbd-srv: close a mapped device from server side.
    ...

    Linus Torvalds
     

08 Dec, 2020

1 commit

  • Block device drivers do not have to call blk_queue_max_hw_sectors() to
    set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
    is acceptable. However, this limit (255 sectors) may not be aligned
    to the device logical block size which cannot be used as is for a
    request maximum size. This is the case for the null_blk device driver.

    Modify blk_queue_max_hw_sectors() to make sure that the request size
    limits specified by the max_hw_sectors and max_sectors queue limits
    are always aligned to the device logical block size. Additionally, to
    avoid introducing a dependence on the execution order of this function
    with blk_queue_logical_block_size(), also modify
    blk_queue_logical_block_size() to perform the same alignment when the
    logical block size is set after max_hw_sectors.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

02 Dec, 2020

1 commit

  • commit 22ada802ede8 ("block: use lcm_not_zero() when stacking
    chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must
    reflect the most limited of all devices in the IO stack.

    Otherwise malformed IO may result. E.g.: prior to this fix,
    ->chunk_sectors = lcm_not_zero(8, 128) would result in
    blk_max_size_offset() splitting IO at 128 sectors rather than the
    required more restrictive 8 sectors.

    And since commit 07d098e6bbad ("block: allow 'chunk_sectors' to be
    non-power-of-2") care must be taken to properly stack chunk_sectors to
    be compatible with the possibility that a non-power-of-2 chunk_sectors
    may be stacked. This is why gcd() is used instead of reverting back
    to using min_not_zero().

    Fixes: 22ada802ede8 ("block: use lcm_not_zero() when stacking chunk_sectors")
    Fixes: 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2")
    Reported-by: John Dorminy
    Reported-by: Bruce Johnston
    Signed-off-by: Mike Snitzer
    Reviewed-by: John Dorminy
    Cc: stable@vger.kernel.org
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

25 Sep, 2020

1 commit

  • Drivers shouldn't really mess with the readahead size, as that is a VM
    concept. Instead set it based on the optimal I/O size by lifting the
    algorithm from the md driver when registering the disk. Also set
    bdi->io_pages there as well by applying the same scheme based on
    max_sectors. To ensure the limits work well for stacking drivers a
    new helper is added to update the readahead limits from the block
    limits, which is also called from disk_stack_limits.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Jan Kara
    Reviewed-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2020

2 commits

  • It is possible, albeit more unlikely, for a block device to have a non
    power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
    which results in a full-stripe size of 1280K. This causes the RAID6's
    io_opt to be advertised as 1280K, and a stacked device _could_ then be
    made to use a blocksize, aka chunk_sectors, that matches non power-of-2
    io_opt of underlying RAID6 -- resulting in stacked device's
    chunk_sectors being a non power-of-2).

    Update blk_queue_chunk_sectors() and blk_max_size_offset() to
    accommodate drivers that need a non power-of-2 chunk_sectors.

    Reviewed-by: Ming Lei
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
    lcm_not_zero() rather than min_not_zero() -- otherwise the final
    'chunk_sectors' could result in sub-optimal alignment of IO to
    component devices in the IO stack.

    Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
    then it is a bug in the driver and the device should be flagged as
    'misaligned'.

    Reviewed-by: Ming Lei
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

16 Sep, 2020

1 commit

  • When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as
    regular disks. In this case, ensure that command completion is correctly
    executed by changing sd_zbc_complete() to return good_bytes instead of 0
    and causing a hang during device probe (endless retries).

    When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to
    have partitions, it will be used as a regular disk. In this case, make sure
    to not do anything in sd_zbc_revalidate_zones() as that triggers warnings.

    Since all these different cases result in subtle settings of the disk queue
    zoned model, introduce the block layer helper function
    blk_queue_set_zoned() to generically implement setting up the effective
    zoned model according to the disk type, the presence of partitions on the
    disk and CONFIG_BLK_DEV_ZONED configuration.

    Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com
    Fixes: b72053072c0b ("block: allow partitions on host aware zone devices")
    Cc:
    Reported-by: Borislav Petkov
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Damien Le Moal
    Signed-off-by: Martin K. Petersen

    Damien Le Moal
     

21 Jul, 2020

3 commits

  • This function is just a tiny wrapper around blk_stack_limits. Open code
    it int the two callers.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This function is just a tiny wrapper around blk_stack_limit and has
    two callers. Simplify the stack a bit by open coding it in the two
    callers.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Lift the code from device mapper into blk_stack_limits to inherity
    the stacking limitations. This ensures we do the right thing for
    all stacked zoned block devices.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

13 May, 2020

1 commit

  • Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
    block device. This is a no-merge write operation.

    A zone append write BIO must:
    * Target a zoned block device
    * Have a sector position indicating the start sector of the target zone
    * The target zone must be a sequential write zone
    * The BIO must not cross a zone boundary
    * The BIO size must not be split to ensure that a single range of LBAs
    is written with a single command.

    Implement these checks in generic_make_request_checks() using the
    helper function blk_check_zone_append(). To avoid write append BIO
    splitting, introduce the new max_zone_append_sectors queue limit
    attribute and ensure that a BIO size is always lower than this limit.
    Export this new limit through sysfs and check these limits in bio_full().

    Also when a LLDD can't dispatch a request to a specific zone, it
    will return BLK_STS_ZONE_RESOURCE indicating this request needs to
    be delayed, e.g. because the zone it will be dispatched to is still
    write-locked. If this happens set the request aside in a local list
    to continue trying dispatching requests such as READ requests or a
    WRITE/ZONE_APPEND requests targetting other zones. This way we can
    still keep a high queue depth without starving other requests even if
    one request can't be served due to zone write-locking.

    Finally, make sure that the bio sector position indicates the actual
    write position as indicated by the device on completion.

    Signed-off-by: Keith Busch
    [ jth: added zone-append specific add_page and merge_page helpers ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Keith Busch
     

23 Apr, 2020

1 commit

  • Don't burden the common block code with with specifics of the libata DMA
    draining mechanism. Instead move most of the code to the scsi midlayer.

    That also means the nr_phys_segments adjustments in the blk-mq fast path
    can go away entirely, given that SCSI never looks at nr_phys_segments
    after mapping the request to a scatterlist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Mar, 2020

1 commit

  • Pull block driver updates from Jens Axboe:

    - floppy driver cleanup series from Willy

    - NVMe updates and fixes (Various)

    - null_blk trace improvements (Chaitanya)

    - bcache fixes (Coly)

    - md fixes (via Song)

    - loop block size change optimizations (Martijn)

    - scnprintf() use (Takashi)

    * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
    null_blk: add trace in null_blk_zoned.c
    null_blk: add tracepoint helpers for zoned mode
    block: add a zone condition debug helper
    nvme: cleanup namespace identifier reporting in nvme_init_ns_head
    nvme: rename __nvme_find_ns_head to nvme_find_ns_head
    nvme: refactor nvme_identify_ns_descs error handling
    nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
    nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
    nvme: Fix controller creation races with teardown flow
    nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
    nvme: Fix ctrl use-after-free during sysfs deletion
    nvme-pci: Re-order nvme_pci_free_ctrl
    nvme: Remove unused return code from nvme_delete_ctrl_sync
    nvme: Use nvme_state_terminal helper
    nvme: release ida resources
    nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
    nvmet-tcp: optimize tcp stack TX when data digest is used
    nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
    nvme-multipath: do not reset on unknown status
    nvmet-rdma: allocate RW ctxs according to mdts
    ...

    Linus Torvalds
     

28 Mar, 2020

1 commit

  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Mar, 2020

1 commit

  • Field bdi->io_pages added in commit 9491ae4aade6 ("mm: don't cap request
    size based on read-ahead setting") removes unneeded split of read requests.

    Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
    limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
    Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

    This patch updates io_pages after merging limits in disk_stack_limits().

    Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
    the same problem for device-mapper devices, this one fixes MD RAIDs.

    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reviewed-by: Paul Menzel
    Reviewed-by: Bob Liu
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Song Liu

    Konstantin Khlebnikov
     

16 Jan, 2020

1 commit

  • Logical block size has type unsigned short. That means that it can be at
    most 32768. However, there are architectures that can run with 64k pages
    (for example arm64) and on these architectures, it may be possible to
    create block devices with 64k block size.

    For exmaple (run this on an architecture with 64k pages):

    Mount will fail with this error because it tries to read the superblock using 2-sector
    access:
    device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
    EXT4-fs (dm-0): unable to read superblock

    This patch changes the logical block size from unsigned short to unsigned
    int to avoid the overflow.

    Cc: stable@vger.kernel.org
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Ming Lei
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

20 Sep, 2019

1 commit

  • Pull dma-mapping updates from Christoph Hellwig:

    - add dma-mapping and block layer helpers to take care of IOMMU merging
    for mmc plus subsequent fixups (Yoshihiro Shimoda)

    - rework handling of the pgprot bits for remapping (me)

    - take care of the dma direct infrastructure for swiotlb-xen (me)

    - improve the dma noncoherent remapping infrastructure (me)

    - better defaults for ->mmap, ->get_sgtable and ->get_required_mask
    (me)

    - cleanup mmaping of coherent DMA allocations (me)

    - various misc cleanups (Andy Shevchenko, me)

    * tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits)
    mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE
    mmc: queue: Fix bigger segments usage
    arm64: use asm-generic/dma-mapping.h
    swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page
    swiotlb-xen: simplify cache maintainance
    swiotlb-xen: use the same foreign page check everywhere
    swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable
    xen: remove the exports for xen_{create,destroy}_contiguous_region
    xen/arm: remove xen_dma_ops
    xen/arm: simplify dma_cache_maint
    xen/arm: use dev_is_dma_coherent
    xen/arm: consolidate page-coherent.h
    xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance
    arm: remove wrappers for the generic dma remap helpers
    dma-mapping: introduce a dma_common_find_pages helper
    dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
    vmalloc: lift the arm flag for coherent mappings to common code
    dma-mapping: provide a better default ->get_required_mask
    dma-mapping: remove the dma_declare_coherent_memory export
    remoteproc: don't allow modular build
    ...

    Linus Torvalds
     

06 Sep, 2019

1 commit

  • Introduce the definition of elevator features through the
    elevator_features flags in the elevator_type structure. Each flag can
    represent a feature supported by an elevator. The first feature defined
    by this patch is support for zoned block device sequential write
    constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
    by the mq-deadline elevator using zone write locking.

    Other possible features are IO priorities, write hints, latency targets
    or single-LUN dual-actuator disks (for which the elevator could maintain
    one LBA ordered list per actuator).

    The required_elevator_features field is also added to the request_queue
    structure to allow a device driver to specify elevator feature flags
    that an elevator must support for the correct operation of the device
    (e.g. device drivers for zoned block devices can have the
    ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
    The helper function blk_queue_required_elevator_features() is
    defined for setting this new field.

    With these two new fields in place, the elevator functions
    elevator_match() and elevator_find() are modified to allow a user to set
    only an elevator with a set of features that satisfies the device
    required features. Elevators not matching the device requirements are
    not shown in the device sysfs queue/scheduler file to prevent their use.

    The "none" elevator can always be selected as before.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

03 Sep, 2019

1 commit


29 Aug, 2019

1 commit


27 Jul, 2019

1 commit

  • We should only set the max segment size to unlimited if we actually
    have a virt boundary. Otherwise we accidentally clear that limit
    when called from the SCSI midlayer, which always calls
    blk_queue_virt_boundary, even if that mask is 0.

    Fixes: 7ad388d8e4c7 ("scsi: core: add a host / host template field for the virt boundary")
    Reported-by: Guenter Roeck
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 May, 2019

1 commit

  • We currently fail to update the front/back segment size in the bio when
    deciding to allow an otherwise gappy segement to a device with a
    virt boundary. The reason why this did not cause problems is that
    devices with a virt boundary fundamentally don't use segments as we
    know it and thus don't care. Make that assumption formal by forcing
    an unlimited segement size in this case.

    Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 May, 2019

1 commit