03 Aug, 2020

1 commit

  • When we loose a device for whatever reason while (re)scanning zones, we
    trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
    log:

    sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
    sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
    sd 0:0:0:0: [sda] Write Protect is off
    sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
    sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
    sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
    sd 0:0:0:0: [sda] Sense Key : 0xb [current]
    sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
    sda: failed to revalidate zones
    sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
    sda: detected capacity change from 14000519643136 to 0
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
    Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58

    CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:
    dump_stack+0x7d/0xb0
    ? blk_revalidate_zone_cb+0x1b7/0x550
    kasan_report.cold+0x5/0x37
    ? blk_revalidate_zone_cb+0x1b7/0x550
    check_memory_region+0x145/0x1a0
    blk_revalidate_zone_cb+0x1b7/0x550
    sd_zbc_parse_report+0x1f1/0x370
    ? blk_req_zone_write_trylock+0x200/0x200
    ? sectors_to_logical+0x60/0x60
    ? blk_req_zone_write_trylock+0x200/0x200
    ? blk_req_zone_write_trylock+0x200/0x200
    sd_zbc_report_zones+0x3c4/0x5e0
    ? sd_dif_config_host+0x500/0x500
    blk_revalidate_disk_zones+0x231/0x44d
    ? _raw_write_lock_irqsave+0xb0/0xb0
    ? blk_queue_free_zone_bitmaps+0xd0/0xd0
    sd_zbc_read_zones+0x8cf/0x11a0
    sd_revalidate_disk+0x305c/0x64e0
    ? __device_add_disk+0x776/0xf20
    ? read_capacity_16.part.0+0x1080/0x1080
    ? blk_alloc_devt+0x250/0x250
    ? create_object.isra.0+0x595/0xa20
    ? kasan_unpoison_shadow+0x33/0x40
    sd_probe+0x8dc/0xcd2
    really_probe+0x20e/0xaf0
    __driver_attach_async_helper+0x249/0x2d0
    async_run_entry_fn+0xbe/0x560
    process_one_work+0x764/0x1290
    ? _raw_read_unlock_irqrestore+0x30/0x30
    worker_thread+0x598/0x12f0
    ? __kthread_parkme+0xc6/0x1b0
    ? schedule+0xed/0x2c0
    ? process_one_work+0x1290/0x1290
    kthread+0x36b/0x440
    ? kthread_create_worker_on_cpu+0xa0/0xa0
    ret_from_fork+0x22/0x30
    ==================================================================

    When the device is already gone we end up with the following scenario:
    The device's capacity is 0 and thus the number of zones will be 0 as well. When
    allocating the bitmap for the conventional zones, we then trip over a NULL
    pointer.

    So if we encounter a zoned block device with a 0 capacity, don't dare to
    revalidate the zones sizes.

    Fixes: 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

08 Jul, 2020

1 commit

  • In the zoned storage model, the sectors within a zone are typically all
    writeable. With the introduction of the Zoned Namespace (ZNS) Command
    Set in the NVM Express organization, the model was extended to have a
    specific writeable capacity.

    Extend the zone descriptor data structure with a zone capacity field to
    indicate to the user how many sectors in a zone are writeable.

    Introduce backward compatibility in the zone report ioctl by extending
    the zone report header data structure with a flags field to indicate if
    the capacity field is available.

    Reviewed-by: Jens Axboe
    Reviewed-by: Javier González
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Daniel Wagner
    Signed-off-by: Matias Bjørling
    Signed-off-by: Christoph Hellwig

    Matias Bjørling
     

13 May, 2020

2 commits

  • Modify the interface of blk_revalidate_disk_zones() to add an optional
    driver callback function that a driver can use to extend processing
    done during zone revalidation. The callback, if defined, is executed
    with the device request queue frozen, after all zones have been
    inspected.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
    for a sequential zone or returns false, if the zone is already locked.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

31 Mar, 2020

1 commit

  • Pull block driver updates from Jens Axboe:

    - floppy driver cleanup series from Willy

    - NVMe updates and fixes (Various)

    - null_blk trace improvements (Chaitanya)

    - bcache fixes (Coly)

    - md fixes (via Song)

    - loop block size change optimizations (Martijn)

    - scnprintf() use (Takashi)

    * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
    null_blk: add trace in null_blk_zoned.c
    null_blk: add tracepoint helpers for zoned mode
    block: add a zone condition debug helper
    nvme: cleanup namespace identifier reporting in nvme_init_ns_head
    nvme: rename __nvme_find_ns_head to nvme_find_ns_head
    nvme: refactor nvme_identify_ns_descs error handling
    nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
    nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
    nvme: Fix controller creation races with teardown flow
    nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
    nvme: Fix ctrl use-after-free during sysfs deletion
    nvme-pci: Re-order nvme_pci_free_ctrl
    nvme: Remove unused return code from nvme_delete_ctrl_sync
    nvme: Use nvme_state_terminal helper
    nvme: release ida resources
    nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
    nvmet-tcp: optimize tcp stack TX when data digest is used
    nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
    nvme-multipath: do not reset on unknown status
    nvmet-rdma: allocate RW ctxs according to mdts
    ...

    Linus Torvalds
     

28 Mar, 2020

1 commit


12 Mar, 2020

1 commit

  • Check for overflow in addition before checking for end-of-block-device.

    Steps to reproduce:

    #define _GNU_SOURCE 1
    #include
    #include
    #include
    #include

    typedef unsigned long long __u64;

    struct blk_zone_range {
    __u64 sector;
    __u64 nr_sectors;
    };

    #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)

    int main(void)
    {
    int fd = open("/dev/nullb0", O_RDWR|O_DIRECT);
    struct blk_zone_range zr = {4096, 0xfffffffffffff000ULL};
    ioctl(fd, BLKRESETZONE, &zr);
    return 0;
    }

    BUG: KASAN: null-ptr-deref in submit_bio_wait+0x74/0xe0
    Write of size 8 at addr 0000000000000040 by task a.out/1590

    CPU: 8 PID: 1590 Comm: a.out Not tainted 5.6.0-rc1-00019-g359c92c02bfa #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
    Call Trace:
    dump_stack+0x76/0xa0
    __kasan_report.cold+0x5/0x3e
    kasan_report+0xe/0x20
    submit_bio_wait+0x74/0xe0
    blkdev_zone_mgmt+0x26f/0x2a0
    blkdev_zone_mgmt_ioctl+0x14b/0x1b0
    blkdev_ioctl+0xb28/0xe60
    block_ioctl+0x69/0x80
    ksys_ioctl+0x3af/0xa50

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alexey Dobriyan (SK hynix)
    Signed-off-by: Jens Axboe

    Alexey Dobriyan
     

09 Jan, 2020

1 commit

  • In the current implementation, final zone-mgmt request is issued with
    submit_bio_wait() which marks the bio REQ_SYNC. This is needed since
    immediate action is expected for zone-mgmt requests as these are
    blocking operations. This also bypasses the scheduler in the
    blk_mq_make_request() and dispatches the request directly into the
    hw ctx.

    This patch marks all the chained bios REQ_SYNC so that we can have
    above-mentioned behavior for non-final bios also.

    Reviewed-by: Damien Le Moal
    Reviewed-by: Bob Liu
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

04 Dec, 2019

1 commit

  • The current zone revalidation code has a major problem in that it
    doesn't update the zone size and q->nr_zones atomically, leading
    to a short window where an out of bounds access to the zone arrays
    is possible.

    To fix this move the setting of the zone size into the crticial
    sections blk_revalidate_disk_zones so that it gets updated together
    with the zone bitmaps and q->nr_zones. This also slightly simplifies
    the caller as it deducts the zone size from the report_zones.

    This change also allows to check for a power of two zone size in generic
    code.

    Reported-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Dec, 2019

5 commits


13 Nov, 2019

5 commits

  • Avoid the need to allocate a potentially large array of struct blk_zone
    in the block layer by switching the ->report_zones method interface to
    a callback model. Now the caller simply supplies a callback that is
    executed on each reported zone, and private data for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No known partitioning tool supports zoned block devices, especially the
    host managed flavor with strong sequential write constraints.
    Furthermore, there are also no known user nor use cases for partitioned
    zoned block devices.

    This patch removes partition device creation for zoned block devices,
    which allows simplifying the processing of zone commands for zoned
    block devices. A warning is added if a partition table is found on the
    device.

    For report zones operations no zone sector information remapping is
    necessary anymore, simplifying the code. Of note is that remapping of
    zone reports for DM targets is still necessary as done by
    dm_remap_zone_report().

    Similarly, remaping of a zone reset bio is not necessary anymore.
    Testing for the applicability of the zone reset all request also becomes
    simpler and only needs to check that the number of sectors of the
    requested zone range is equal to the disk capacity.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • All kernel users of blkdev_report_zones() as well as applications use
    through ioctl(BLKZONEREPORT) expect to potentially get less zone
    descriptors than requested. As such, the use of the internal report
    zones command execution loop implemented by blk_report_zones() is
    not necessary and can even be harmful to performance by causing the
    execution of inefficient small zones report command to service the
    reminder of a requested zone array.

    This patch removes blk_report_zones(), simplifying the code. Also
    remove a now incorrect comment in dm_blk_report_zones().

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Javier Gonzalez
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • blk_revalidate_disk_zones is never called for non-zoned devices. Just
    return early and warn instead of trying to handle this case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For ZBC and ZAC zoned devices, the scsi driver revalidation processing
    implemented by sd_revalidate_disk() includes a call to
    sd_zbc_read_zones() which executes a full disk zone report used to
    check that all zones of the disk are the same size. This processing is
    followed by a call to blk_revalidate_disk_zones(), used to initialize
    the device request queue zone bitmaps (zone type and zone write lock
    bitmaps). To do so, blk_revalidate_disk_zones() also executes a full
    device zone report to obtain zone types. As a result, the entire
    zoned block device revalidation process includes two full device zone
    report.

    By moving the zone size checks into blk_revalidate_disk_zones(), this
    process can be optimized to a single full device zone report, leading to
    shorter device scan and revalidation times. This patch implements this
    optimization, reducing the original full device zone report implemented
    in sd_zbc_check_zones() to a single, small, report zones command
    execution to obtain the size of the first zone of the device. Checks
    whether all zones of the device are the same size as the first zone
    size are moved to the generic blk_check_zone() function called from
    blk_revalidate_disk_zones().

    This optimization also has the following benefits:
    1) fewer memory allocations in the scsi layer during disk revalidation
    as the potentailly large buffer for zone report execution is not
    needed.
    2) Implement zone checks in a generic manner, reducing the burden on
    device driver which only need to obtain the zone size and check that
    this size is a power of 2 number of LBAs. Any new type of zoned
    block device will benefit from this.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

07 Nov, 2019

4 commits

  • Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and
    BLKFINISHZONE to allow applications to control the condition of zones
    on a zoned block device through the execution of the REQ_OP_ZONE_OPEN,
    REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     
  • Zoned block devices (ZBC and ZAC devices) allow an explicit control
    over the condition (state) of zones. The operations allowed are:
    * Open a zone: Transition to open condition to indicate that a zone will
    actively be written
    * Close a zone: Transition to closed condition to release the drive
    resources used for writing to a zone
    * Finish a zone: Transition an open or closed zone to the full
    condition to prevent write operations

    To enable this control for in-kernel zoned block device users, define
    the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
    and REQ_OP_ZONE_FINISH as well as the generic function
    blkdev_zone_mgmt() for submitting these operations on a range of zones.
    This results in blkdev_reset_zones() removal and replacement with this
    new zone magement function. Users of blkdev_reset_zones() (f2fs and
    dm-zoned) are updated accordingly.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     
  • There is no need for the function __blkdev_reset_all_zones() as
    REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones()
    bio loop with an early break from the loop. This patch removes this
    function and modifies blkdev_reset_zones(), simplifying the code.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • REQ_OP_ZONE_RESET operations cannot be merged as these bios and requests
    do not have a size and are never sequential due to the zone start sector
    position required for their execution. As a result, there is no point in
    using a plug around blkdev_reset_zones() bio issuing loop. This patch
    removes this unnecessary plugging.

    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

05 Aug, 2019

1 commit

  • This implements REQ_OP_ZONE_RESET_ALL as a special case of the block
    device zone reset operations where we just simply issue bio with the
    newly introduced req op.

    We issue this req op when the number of sectors is equal to the device's
    partition's number of sectors and device has no partitions.

    We also add support so that blk_op_str() can print the new reset-all
    zone operation.

    This patch also adds a generic make request check for newly
    introduced REQ_OP_ZONE_RESET_ALL req_opf. We simply return error
    when queue is zoned and reset-all flag is not set for
    REQ_OP_ZONE_RESET_ALL.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Damien Le Moal
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

12 Jul, 2019

2 commits

  • Limit the size of the struct blk_zone array used in
    blk_revalidate_disk_zones() to avoid memory allocation failures leading
    to disk revalidation failure. Also further reduce the likelyhood of
    such failures by using kvcalloc() (that is vmalloc()) instead of
    allocating contiguous pages with alloc_pages().

    Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
    preparation of using vmalloc() for large report buffer and zone array
    allocations used by this function, remove its "gfp_t gfp_mask" argument
    and rely on the caller context to use memalloc_noio_save/restore() where
    necessary (block layer zone revalidation and dm-zoned I/O error path).

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

10 Jul, 2019

1 commit

  • For large values of the number of zones reported and/or large zone
    sizes, the sector increment calculated with

    blk_queue_zone_sectors(q) * n

    in blk_report_zones() loop can overflow the unsigned int type used for
    the calculation as both "n" and blk_queue_zone_sectors() value are
    unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
    overflow happens with 8192 or more zones reported.

    Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
    this problem and avoids overflow problem for all other callers of this
    helper too. The same change is also applied to the bdev_zone_sectors()
    helper.

    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

01 May, 2019

1 commit


29 Dec, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     

12 Dec, 2018

1 commit

  • null_blk_zoned creation fails if the number of zones specified is equal to or is
    smaller than 64 due to a memory allocation failure in blk_alloc_zones(). With
    such a small number of zones, the required memory size for all zones descriptors
    fits in a single page, and the page order for alloc_pages_node() is zero. Allow
    this value in blk_alloc_zones() for the allocation to succeed.

    Fixes: bf5054569653 "block: Introduce blk_revalidate_disk_zones()"
    Reviewed-by: Damien Le Moal
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: Jens Axboe

    Shin'ichiro Kawasaki
     

16 Nov, 2018

1 commit

  • Various spots check for q->mq_ops being non-NULL, but provide
    a helper to do this instead.

    Where the ->mq_ops != NULL check is redundant, remove it.

    Since mq == rq-based now that legacy is gone, get rid of the
    queue_is_rq_based() and just use queue_is_mq() everywhere.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Oct, 2018

5 commits

  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Dispatching a report zones command through the request queue is a major
    pain due to the command reply payload rewriting necessary. Given that
    blkdev_report_zones() is executing everything synchronously, implement
    report zones as a block device file operation instead, allowing major
    simplification of the code in many places.

    sd, null-blk, dm-linear and dm-flakey being the only block device
    drivers supporting exposing zoned block devices, these drivers are
    modified to provide the device side implementation of the
    report_zones() block device file operation.

    For device mappers, a new report_zones() target type operation is
    defined so that the upper block layer calls blkdev_report_zones() can
    be propagated down to the underlying devices of the dm targets.
    Implementation for this new operation is added to the dm-linear and
    dm-flakey targets.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    [Damien]
    * Changed method block_device argument to gendisk
    * Various bug fixes and improvements
    * Added support for null_blk, dm-linear and dm-flakey.
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no point in allocating more zone descriptors than the number of
    zones a block device has for doing a zone report. Avoid doing that in
    blkdev_report_zones_ioctl() by limiting the number of zone decriptors
    allocated internally to process the user request.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce the blkdev_nr_zones() helper function to get the total
    number of zones of a zoned block device. This number is always 0 for a
    regular block device (q->limits.zoned == BLK_ZONED_NONE case).

    Replace hard-coded number of zones calculation in dmz_get_zoned_device()
    with a call to this helper.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

09 Jul, 2018

1 commit


13 Jun, 2018

1 commit

  • The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
    patch replaces cases of:

    kvmalloc(a * b, gfp)

    with:
    kvmalloc_array(a * b, gfp)

    as well as handling cases of:

    kvmalloc(a * b * c, gfp)

    with:

    kvmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvmalloc
    + kvmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(sizeof(THING) * C2, ...)
    |
    kvmalloc(sizeof(TYPE) * C2, ...)
    |
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(C1 * C2, ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

23 May, 2018

1 commit

  • Avoid that complaints similar to the following appear in the kernel log
    if the number of zones is sufficiently large:

    fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
    Call Trace:
    dump_stack+0x63/0x88
    warn_alloc+0xf5/0x190
    __alloc_pages_slowpath+0x8f0/0xb0d
    __alloc_pages_nodemask+0x242/0x260
    alloc_pages_current+0x6a/0xb0
    kmalloc_order+0x18/0x50
    kmalloc_order_trace+0x26/0xb0
    __kmalloc+0x20e/0x220
    blkdev_report_zones_ioctl+0xa5/0x1a0
    blkdev_ioctl+0x1ba/0x930
    block_ioctl+0x41/0x50
    do_vfs_ioctl+0xaa/0x610
    SyS_ioctl+0x79/0x90
    do_syscall_64+0x79/0x1b0
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
    Signed-off-by: Bart Van Assche
    Cc: Shaun Tancheff
    Cc: Damien Le Moal
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: Hannes Reinecke
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Mar, 2018

1 commit