31 Mar, 2020

1 commit

  • Pull block driver updates from Jens Axboe:

    - floppy driver cleanup series from Willy

    - NVMe updates and fixes (Various)

    - null_blk trace improvements (Chaitanya)

    - bcache fixes (Coly)

    - md fixes (via Song)

    - loop block size change optimizations (Martijn)

    - scnprintf() use (Takashi)

    * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
    null_blk: add trace in null_blk_zoned.c
    null_blk: add tracepoint helpers for zoned mode
    block: add a zone condition debug helper
    nvme: cleanup namespace identifier reporting in nvme_init_ns_head
    nvme: rename __nvme_find_ns_head to nvme_find_ns_head
    nvme: refactor nvme_identify_ns_descs error handling
    nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
    nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
    nvme: Fix controller creation races with teardown flow
    nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
    nvme: Fix ctrl use-after-free during sysfs deletion
    nvme-pci: Re-order nvme_pci_free_ctrl
    nvme: Remove unused return code from nvme_delete_ctrl_sync
    nvme: Use nvme_state_terminal helper
    nvme: release ida resources
    nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
    nvmet-tcp: optimize tcp stack TX when data digest is used
    nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
    nvme-multipath: do not reset on unknown status
    nvmet-rdma: allocate RW ctxs according to mdts
    ...

    Linus Torvalds
     

28 Mar, 2020

1 commit


12 Mar, 2020

1 commit

  • Check for overflow in addition before checking for end-of-block-device.

    Steps to reproduce:

    #define _GNU_SOURCE 1
    #include
    #include
    #include
    #include

    typedef unsigned long long __u64;

    struct blk_zone_range {
    __u64 sector;
    __u64 nr_sectors;
    };

    #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)

    int main(void)
    {
    int fd = open("/dev/nullb0", O_RDWR|O_DIRECT);
    struct blk_zone_range zr = {4096, 0xfffffffffffff000ULL};
    ioctl(fd, BLKRESETZONE, &zr);
    return 0;
    }

    BUG: KASAN: null-ptr-deref in submit_bio_wait+0x74/0xe0
    Write of size 8 at addr 0000000000000040 by task a.out/1590

    CPU: 8 PID: 1590 Comm: a.out Not tainted 5.6.0-rc1-00019-g359c92c02bfa #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
    Call Trace:
    dump_stack+0x76/0xa0
    __kasan_report.cold+0x5/0x3e
    kasan_report+0xe/0x20
    submit_bio_wait+0x74/0xe0
    blkdev_zone_mgmt+0x26f/0x2a0
    blkdev_zone_mgmt_ioctl+0x14b/0x1b0
    blkdev_ioctl+0xb28/0xe60
    block_ioctl+0x69/0x80
    ksys_ioctl+0x3af/0xa50

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alexey Dobriyan (SK hynix)
    Signed-off-by: Jens Axboe

    Alexey Dobriyan
     

09 Jan, 2020

1 commit

  • In the current implementation, final zone-mgmt request is issued with
    submit_bio_wait() which marks the bio REQ_SYNC. This is needed since
    immediate action is expected for zone-mgmt requests as these are
    blocking operations. This also bypasses the scheduler in the
    blk_mq_make_request() and dispatches the request directly into the
    hw ctx.

    This patch marks all the chained bios REQ_SYNC so that we can have
    above-mentioned behavior for non-final bios also.

    Reviewed-by: Damien Le Moal
    Reviewed-by: Bob Liu
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

04 Dec, 2019

1 commit

  • The current zone revalidation code has a major problem in that it
    doesn't update the zone size and q->nr_zones atomically, leading
    to a short window where an out of bounds access to the zone arrays
    is possible.

    To fix this move the setting of the zone size into the crticial
    sections blk_revalidate_disk_zones so that it gets updated together
    with the zone bitmaps and q->nr_zones. This also slightly simplifies
    the caller as it deducts the zone size from the report_zones.

    This change also allows to check for a power of two zone size in generic
    code.

    Reported-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Dec, 2019

5 commits


13 Nov, 2019

5 commits

  • Avoid the need to allocate a potentially large array of struct blk_zone
    in the block layer by switching the ->report_zones method interface to
    a callback model. Now the caller simply supplies a callback that is
    executed on each reported zone, and private data for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No known partitioning tool supports zoned block devices, especially the
    host managed flavor with strong sequential write constraints.
    Furthermore, there are also no known user nor use cases for partitioned
    zoned block devices.

    This patch removes partition device creation for zoned block devices,
    which allows simplifying the processing of zone commands for zoned
    block devices. A warning is added if a partition table is found on the
    device.

    For report zones operations no zone sector information remapping is
    necessary anymore, simplifying the code. Of note is that remapping of
    zone reports for DM targets is still necessary as done by
    dm_remap_zone_report().

    Similarly, remaping of a zone reset bio is not necessary anymore.
    Testing for the applicability of the zone reset all request also becomes
    simpler and only needs to check that the number of sectors of the
    requested zone range is equal to the disk capacity.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • All kernel users of blkdev_report_zones() as well as applications use
    through ioctl(BLKZONEREPORT) expect to potentially get less zone
    descriptors than requested. As such, the use of the internal report
    zones command execution loop implemented by blk_report_zones() is
    not necessary and can even be harmful to performance by causing the
    execution of inefficient small zones report command to service the
    reminder of a requested zone array.

    This patch removes blk_report_zones(), simplifying the code. Also
    remove a now incorrect comment in dm_blk_report_zones().

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Javier Gonzalez
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • blk_revalidate_disk_zones is never called for non-zoned devices. Just
    return early and warn instead of trying to handle this case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For ZBC and ZAC zoned devices, the scsi driver revalidation processing
    implemented by sd_revalidate_disk() includes a call to
    sd_zbc_read_zones() which executes a full disk zone report used to
    check that all zones of the disk are the same size. This processing is
    followed by a call to blk_revalidate_disk_zones(), used to initialize
    the device request queue zone bitmaps (zone type and zone write lock
    bitmaps). To do so, blk_revalidate_disk_zones() also executes a full
    device zone report to obtain zone types. As a result, the entire
    zoned block device revalidation process includes two full device zone
    report.

    By moving the zone size checks into blk_revalidate_disk_zones(), this
    process can be optimized to a single full device zone report, leading to
    shorter device scan and revalidation times. This patch implements this
    optimization, reducing the original full device zone report implemented
    in sd_zbc_check_zones() to a single, small, report zones command
    execution to obtain the size of the first zone of the device. Checks
    whether all zones of the device are the same size as the first zone
    size are moved to the generic blk_check_zone() function called from
    blk_revalidate_disk_zones().

    This optimization also has the following benefits:
    1) fewer memory allocations in the scsi layer during disk revalidation
    as the potentailly large buffer for zone report execution is not
    needed.
    2) Implement zone checks in a generic manner, reducing the burden on
    device driver which only need to obtain the zone size and check that
    this size is a power of 2 number of LBAs. Any new type of zoned
    block device will benefit from this.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

07 Nov, 2019

4 commits

  • Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and
    BLKFINISHZONE to allow applications to control the condition of zones
    on a zoned block device through the execution of the REQ_OP_ZONE_OPEN,
    REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     
  • Zoned block devices (ZBC and ZAC devices) allow an explicit control
    over the condition (state) of zones. The operations allowed are:
    * Open a zone: Transition to open condition to indicate that a zone will
    actively be written
    * Close a zone: Transition to closed condition to release the drive
    resources used for writing to a zone
    * Finish a zone: Transition an open or closed zone to the full
    condition to prevent write operations

    To enable this control for in-kernel zoned block device users, define
    the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
    and REQ_OP_ZONE_FINISH as well as the generic function
    blkdev_zone_mgmt() for submitting these operations on a range of zones.
    This results in blkdev_reset_zones() removal and replacement with this
    new zone magement function. Users of blkdev_reset_zones() (f2fs and
    dm-zoned) are updated accordingly.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     
  • There is no need for the function __blkdev_reset_all_zones() as
    REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones()
    bio loop with an early break from the loop. This patch removes this
    function and modifies blkdev_reset_zones(), simplifying the code.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • REQ_OP_ZONE_RESET operations cannot be merged as these bios and requests
    do not have a size and are never sequential due to the zone start sector
    position required for their execution. As a result, there is no point in
    using a plug around blkdev_reset_zones() bio issuing loop. This patch
    removes this unnecessary plugging.

    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

05 Aug, 2019

1 commit

  • This implements REQ_OP_ZONE_RESET_ALL as a special case of the block
    device zone reset operations where we just simply issue bio with the
    newly introduced req op.

    We issue this req op when the number of sectors is equal to the device's
    partition's number of sectors and device has no partitions.

    We also add support so that blk_op_str() can print the new reset-all
    zone operation.

    This patch also adds a generic make request check for newly
    introduced REQ_OP_ZONE_RESET_ALL req_opf. We simply return error
    when queue is zoned and reset-all flag is not set for
    REQ_OP_ZONE_RESET_ALL.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Damien Le Moal
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

12 Jul, 2019

2 commits

  • Limit the size of the struct blk_zone array used in
    blk_revalidate_disk_zones() to avoid memory allocation failures leading
    to disk revalidation failure. Also further reduce the likelyhood of
    such failures by using kvcalloc() (that is vmalloc()) instead of
    allocating contiguous pages with alloc_pages().

    Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
    preparation of using vmalloc() for large report buffer and zone array
    allocations used by this function, remove its "gfp_t gfp_mask" argument
    and rely on the caller context to use memalloc_noio_save/restore() where
    necessary (block layer zone revalidation and dm-zoned I/O error path).

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

10 Jul, 2019

1 commit

  • For large values of the number of zones reported and/or large zone
    sizes, the sector increment calculated with

    blk_queue_zone_sectors(q) * n

    in blk_report_zones() loop can overflow the unsigned int type used for
    the calculation as both "n" and blk_queue_zone_sectors() value are
    unsigned int. E.g. for a device with 256 MB zones (524288 sectors),
    overflow happens with 8192 or more zones reported.

    Changing the return type of blk_queue_zone_sectors() to sector_t, fixes
    this problem and avoids overflow problem for all other callers of this
    helper too. The same change is also applied to the bdev_zone_sectors()
    helper.

    Fixes: e76239a3748c ("block: add a report_zones method")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

01 May, 2019

1 commit


29 Dec, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     

12 Dec, 2018

1 commit

  • null_blk_zoned creation fails if the number of zones specified is equal to or is
    smaller than 64 due to a memory allocation failure in blk_alloc_zones(). With
    such a small number of zones, the required memory size for all zones descriptors
    fits in a single page, and the page order for alloc_pages_node() is zero. Allow
    this value in blk_alloc_zones() for the allocation to succeed.

    Fixes: bf5054569653 "block: Introduce blk_revalidate_disk_zones()"
    Reviewed-by: Damien Le Moal
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: Jens Axboe

    Shin'ichiro Kawasaki
     

16 Nov, 2018

1 commit

  • Various spots check for q->mq_ops being non-NULL, but provide
    a helper to do this instead.

    Where the ->mq_ops != NULL check is redundant, remove it.

    Since mq == rq-based now that legacy is gone, get rid of the
    queue_is_rq_based() and just use queue_is_mq() everywhere.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Oct, 2018

5 commits

  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Dispatching a report zones command through the request queue is a major
    pain due to the command reply payload rewriting necessary. Given that
    blkdev_report_zones() is executing everything synchronously, implement
    report zones as a block device file operation instead, allowing major
    simplification of the code in many places.

    sd, null-blk, dm-linear and dm-flakey being the only block device
    drivers supporting exposing zoned block devices, these drivers are
    modified to provide the device side implementation of the
    report_zones() block device file operation.

    For device mappers, a new report_zones() target type operation is
    defined so that the upper block layer calls blkdev_report_zones() can
    be propagated down to the underlying devices of the dm targets.
    Implementation for this new operation is added to the dm-linear and
    dm-flakey targets.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    [Damien]
    * Changed method block_device argument to gendisk
    * Various bug fixes and improvements
    * Added support for null_blk, dm-linear and dm-flakey.
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no point in allocating more zone descriptors than the number of
    zones a block device has for doing a zone report. Avoid doing that in
    blkdev_report_zones_ioctl() by limiting the number of zone decriptors
    allocated internally to process the user request.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce the blkdev_nr_zones() helper function to get the total
    number of zones of a zoned block device. This number is always 0 for a
    regular block device (q->limits.zoned == BLK_ZONED_NONE case).

    Replace hard-coded number of zones calculation in dmz_get_zoned_device()
    with a call to this helper.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

09 Jul, 2018

1 commit


13 Jun, 2018

1 commit

  • The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
    patch replaces cases of:

    kvmalloc(a * b, gfp)

    with:
    kvmalloc_array(a * b, gfp)

    as well as handling cases of:

    kvmalloc(a * b * c, gfp)

    with:

    kvmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvmalloc
    + kvmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(sizeof(THING) * C2, ...)
    |
    kvmalloc(sizeof(TYPE) * C2, ...)
    |
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(C1 * C2, ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

23 May, 2018

1 commit

  • Avoid that complaints similar to the following appear in the kernel log
    if the number of zones is sufficiently large:

    fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
    Call Trace:
    dump_stack+0x63/0x88
    warn_alloc+0xf5/0x190
    __alloc_pages_slowpath+0x8f0/0xb0d
    __alloc_pages_nodemask+0x242/0x260
    alloc_pages_current+0x6a/0xb0
    kmalloc_order+0x18/0x50
    kmalloc_order_trace+0x26/0xb0
    __kmalloc+0x20e/0x220
    blkdev_report_zones_ioctl+0xa5/0x1a0
    blkdev_ioctl+0x1ba/0x930
    block_ioctl+0x41/0x50
    do_vfs_ioctl+0xaa/0x610
    SyS_ioctl+0x79/0x90
    do_syscall_64+0x79/0x1b0
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
    Signed-off-by: Bart Van Assche
    Cc: Shaun Tancheff
    Cc: Damien Le Moal
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: Hannes Reinecke
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Mar, 2018

1 commit


06 Jan, 2018

1 commit

  • Components relying only on the request_queue structure for accessing
    block devices (e.g. I/O schedulers) have a limited knowledged of the
    device characteristics. In particular, the device capacity cannot be
    easily discovered, which for a zoned block device also result in the
    inability to easily know the number of zones of the device (the zone
    size is indicated by the chunk_sectors field of the queue limits).

    Introduce the nr_zones field to the request_queue structure to simplify
    access to this information. Also, add the bitmap seq_zone_bitmap which
    indicates which zones of the device are sequential zones (write
    preferred or write required) and the bitmap seq_zones_wlock which
    indicates if a zone is write locked, that is, if a write request
    targeting a zone was dispatched to the device. These fields are
    initialized by the low level block device driver (sd.c for ZBC/ZAC
    disks). They are not initialized by stacking drivers (device mappers)
    handling zoned block devices (e.g. dm-linear).

    Using this, I/O schedulers can introduce zone write locking to control
    request dispatching to a zoned block device and avoid write request
    reordering by limiting to at most a single write request per zone
    outside of the scheduler at any time.

    Based on previous patches from Damien Le Moal.

    Signed-off-by: Christoph Hellwig
    [Damien]
    * Fixed comments and identation in blkdev.h
    * Changed helper functions
    * Fixed this commit message
    Signed-off-by: Damien Le Moal
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Jan, 2017

1 commit

  • All block device data fields and functions returning a number of 512B
    sectors are by convention named xxx_sectors while names in the form
    xxx_size are generally used for a number of bytes. The blk_queue_zone_size
    and bdev_zone_size functions were not following this convention so rename
    them.

    No functional change is introduced by this patch.

    Signed-off-by: Damien Le Moal

    Collapsed the two patches, they were nonsensically split and broke
    bisection.

    Signed-off-by: Jens Axboe

    Damien Le Moal
     

25 Oct, 2016

1 commit

  • The blkdev_report_zones produces a harmless warning when
    -Wmaybe-uninitialized is set, after gcc gets a little confused
    about the multiple 'goto' here:

    block/blk-zoned.c: In function 'blkdev_report_zones':
    block/blk-zoned.c:188:13: error: 'nz' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    Moving the assignment to nr_zones makes this a little simpler
    while also avoiding the warning reliably. I'm removing the
    extraneous initialization of 'int ret' in the same patch, as
    that is semi-related and could cause an uninitialized use of
    that variable to not produce a warning.

    Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Shaun Tancheff
    Reviewed-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Arnd Bergmann