22 May, 2020

1 commit


19 May, 2020

13 commits


17 May, 2020

2 commits

  • Currently informational messages within block trace do not have PID
    information of the process reporting the message included. With BFQ it
    is sometimes useful to have the information and there's no good reason
    to omit the information from the trace. So just fill in pid information
    when generating note message.

    Signed-off-by: Jan Kara
    Reviewed-by: Chaitanya Kulkarni
    Acked-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 May, 2020

7 commits

  • Blk-crypto delegates crypto operations to inline encryption hardware
    when available. The separately configurable blk-crypto-fallback contains
    a software fallback to the kernel crypto API - when enabled, blk-crypto
    will use this fallback for en/decryption when inline encryption hardware
    is not available.

    This lets upper layers not have to worry about whether or not the
    underlying device has support for inline encryption before deciding to
    specify an encryption context for a bio. It also allows for testing
    without actual inline encryption hardware - in particular, it makes it
    possible to test the inline encryption code in ext4 and f2fs simply by
    running xfstests with the inlinecrypt mount option, which in turn allows
    for things like the regular upstream regression testing of ext4 to cover
    the inline encryption code paths.

    For more details, refer to Documentation/block/inline-encryption.rst.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • Whenever a device supports blk-integrity, make the kernel pretend that
    the device doesn't support inline encryption (essentially by setting the
    keyslot manager in the request queue to NULL).

    There's no hardware currently that supports both integrity and inline
    encryption. However, it seems possible that there will be such hardware
    in the near future (like the NVMe key per I/O support that might support
    both inline encryption and PI).

    But properly integrating both features is not trivial, and without
    real hardware that implements both, it is difficult to tell if it will
    be done correctly by the majority of hardware that support both.
    So it seems best not to support both features together right now, and
    to decide what to do at probe time.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • Inline Encryption hardware allows software to specify an encryption context
    (an encryption key, crypto algorithm, data unit num, data unit size) along
    with a data transfer request to a storage device, and the inline encryption
    hardware will use that context to en/decrypt the data. The inline
    encryption hardware is part of the storage device, and it conceptually sits
    on the data path between system memory and the storage device.

    Inline Encryption hardware implementations often function around the
    concept of "keyslots". These implementations often have a limited number
    of "keyslots", each of which can hold a key (we say that a key can be
    "programmed" into a keyslot). Requests made to the storage device may have
    a keyslot and a data unit number associated with them, and the inline
    encryption hardware will en/decrypt the data in the requests using the key
    programmed into that associated keyslot and the data unit number specified
    with the request.

    As keyslots are limited, and programming keys may be expensive in many
    implementations, and multiple requests may use exactly the same encryption
    contexts, we introduce a Keyslot Manager to efficiently manage keyslots.

    We also introduce a blk_crypto_key, which will represent the key that's
    programmed into keyslots managed by keyslot managers. The keyslot manager
    also functions as the interface that upper layers will use to program keys
    into inline encryption hardware. For more information on the Keyslot
    Manager, refer to documentation found in block/keyslot-manager.c and
    linux/keyslot-manager.h.

    Co-developed-by: Eric Biggers
    Signed-off-by: Eric Biggers
    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • The blk-crypto framework adds support for inline encryption. There are
    numerous changes throughout the storage stack. This patch documents the
    main design choices in the block layer, the API presented to users of
    the block layer (like fscrypt or layered devices) and the API presented
    to drivers for adding support for inline encryption.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • When the QoS targets are met and nothing is being throttled, there's
    no way to tell how saturated the underlying device is - it could be
    almost entirely idle, at the cusp of saturation or anywhere inbetween.
    Given that there's no information, it's best to keep vrate as-is in
    this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
    handling"), this was the case - if the device isn't missing QoS
    targets and nothing is being throttled, busy_level was reset to zero.

    While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
    nr_lagging handling") broke this. Now, while the device is hitting
    QoS targets and nothing is being throttled, vrate keeps getting
    adjusted according to the existing busy_level.

    This led to vrate keeping climing till it hits max when there's an IO
    issuer with limited request concurrency if the vrate started low.
    vrate starts getting adjusted upwards until the issuer can issue IOs
    w/o being throttled. From then on, QoS targets keeps getting met and
    nothing on the system needs throttling and vrate keeps getting
    increased due to the existing busy_level.

    This patch makes the following changes to the busy_level logic.

    * Reset busy_level if nr_shortages is zero to avoid the above
    scenario.

    * Make non-zero nr_lagging block lowering nr_level but still clear
    positive busy_level if there's clear non-saturation signal - QoS
    targets are met and nr_shortages is non-zero. nr_lagging's role is
    preventing adjusting vrate upwards while there are long-running
    commands and it shouldn't keep busy_level positive while there's
    clear non-saturation signal.

    * Restructure code for clarity and add comments.

    Signed-off-by: Tejun Heo
    Reported-by: Andy Newell
    Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_io_schedule() isn't called from performance sensitive code path, and
    it is easier to maintain by exporting it as symbol.

    Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
    to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.

    Cc: Christoph Hellwig
    Fixes: e6249cdd46e4 ("block: add blk_io_schedule() for avoiding task hung in sync dio")
    Reported-by: Satya Tangirala
    Tested-by: Satya Tangirala
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 May, 2020

16 commits

  • Synchronous direct I/O to a sequential write only zone can be issued using
    the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
    BIOs can potentially result in reordering, we cannot support asynchronous
    IO via this interface.

    We also can only dispatch up to queue_max_zone_append_sectors() via the
    new zone-append method and have to return a short write back to user-space
    in case an IO larger than queue_max_zone_append_sectors() has been issued.

    Signed-off-by: Johannes Thumshirn
    Acked-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
    from modular code.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned
    mode enabled. Use the internally tracked zone write pointer position
    as the actual write position and return it using the command request
    __sector field in the case of an mq device and using the command BIO
    sector in the case of a BIO device.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) command
    with a start LBA set to the target zone write pointer position.

    In order to always know the write pointer position of a sequential write
    zone, the write pointer of all zones is tracked using an array of 32bits
    zone write pointer offset attached to the scsi disk structure. Each
    entry of the array indicate a zone write pointer position relative to
    the zone start sector. The write pointer offsets are maintained in sync
    with the device as follows:
    1) the write pointer offset of a zone is reset to 0 when a
    REQ_OP_ZONE_RESET command completes.
    2) the write pointer offset of a zone is set to the zone size when a
    REQ_OP_ZONE_FINISH command completes.
    3) the write pointer offset of a zone is incremented by the number of
    512B sectors written when a write, write same or a zone append
    command completes.
    4) the write pointer offset of all zones is reset to 0 when a
    REQ_OP_ZONE_RESET_ALL command completes.

    Since the block layer does not write lock zones for zone append
    commands, to ensure a sequential ordering of the regular write commands
    used for the emulation, the target zone of a zone append command is
    locked when the function sd_zbc_prepare_zone_append() is called from
    sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
    (e.g. a zone append is in-flight or a regular write has already locked
    the zone), the zone append command dispatching is delayed by returning
    BLK_STS_ZONE_RESOURCE.

    To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
    requests, use a spinlock to protect accesses and modifications of the
    zone write pointer offsets. This spinlock is initialized from sd_probe()
    using the new function sd_zbc_init().

    Co-developed-by: Damien Le Moal
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

    This will help with the introduction of an emulated ZONE_APPEND command.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Modify the interface of blk_revalidate_disk_zones() to add an optional
    driver callback function that a driver can use to extend processing
    done during zone revalidation. The callback, if defined, is executed
    with the device request queue frozen, after all zones have been
    inspected.

    Signed-off-by: Damien Le Moal
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
    for a sequential zone or returns false, if the zone is already locked.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
    block device. This is a no-merge write operation.

    A zone append write BIO must:
    * Target a zoned block device
    * Have a sector position indicating the start sector of the target zone
    * The target zone must be a sequential write zone
    * The BIO must not cross a zone boundary
    * The BIO size must not be split to ensure that a single range of LBAs
    is written with a single command.

    Implement these checks in generic_make_request_checks() using the
    helper function blk_check_zone_append(). To avoid write append BIO
    splitting, introduce the new max_zone_append_sectors queue limit
    attribute and ensure that a BIO size is always lower than this limit.
    Export this new limit through sysfs and check these limits in bio_full().

    Also when a LLDD can't dispatch a request to a specific zone, it
    will return BLK_STS_ZONE_RESOURCE indicating this request needs to
    be delayed, e.g. because the zone it will be dispatched to is still
    write-locked. If this happens set the request aside in a local list
    to continue trying dispatching requests such as READ requests or a
    WRITE/ZONE_APPEND requests targetting other zones. This way we can
    still keep a high queue depth without starving other requests even if
    one request can't be served due to zone write-locking.

    Finally, make sure that the bio sector position indicates the actual
    write position as indicated by the device on completion.

    Signed-off-by: Keith Busch
    [ jth: added zone-append specific add_page and merge_page helpers ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
    max_sectors argument.

    This max_sectors argument can be used to specify constraints from the
    hardware.

    Signed-off-by: Christoph Hellwig
    [ jth: rebased and made public for blk-map.c ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Daniel Wagner
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
    CONFIG_BLK_DEV_ZONED disabled until now.

    The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
    provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Sync dio could be big, or may take long time in discard or in case of
    IO failure.

    We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
    so apply the same trick for prevent task hung from happening in sync dio.

    Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
    task hung warning.

    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Cc: Salman Qazi
    Cc: Jesse Barnes
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • gendisk can't be gone when there is IO activity, so not hold
    part0's refcount in IO path.

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Yufen Yu
    Cc: Christoph Hellwig
    Cc: Hou Tao
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Put all fields accessed in IO path together at the beginning
    of the struct, so that all can be fetched in single cacheline.

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Yufen Yu
    Cc: Christoph Hellwig
    Cc: Hou Tao
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
    so define it just for 32bit SMP.

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Yufen Yu
    Cc: Christoph Hellwig
    Cc: Hou Tao
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • delete_partition() clears the cached last_lookup partition. However the
    .last_lookup cache may be overwritten by one IO path after it is cleared
    from delete_partition(). Then another IO path may use the cached deleting
    partition after hd_struct_free() is called, then use-after-free is triggered
    on the cached partition.

    Fixes the issue by the following approach:

    1) always get the partition's refcount via hd_struct_try_get() before
    setting .last_lookup

    2) move clearing .last_lookup from delete_partition() to hd_struct_free()
    which is the release handle of the partition's percpu-refcount, so that no
    IO path can cache deleteing partition via .last_lookup.

    It is one candidate approach of Yufen's patch[1] which adds overhead
    in fast path by indirect lookup which may introduce one extra cacheline
    in IO path. Also this patch relies on percpu-refcount's protection, and
    it is easier to understand and verify.

    [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#t

    Reported-by: Yufen Yu
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: Hou Tao
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When we increase hardware queue count, blk_mq_update_queue_map will
    reset the mapping between cpu and hardware queue base on the hardware
    queue count(set->nr_hw_queues). The mapping cannot be reset if it
    encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
    continue using it, then blk_mq_map_swqueue will touch a invalid memory,
    because the mapping points to a wrong hctx.

    blktest block/030:

    null_blk: module loaded
    Increasing nr_hw_queues to 8 fails, fallback to 1
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
    Read of size 8 at addr 0000000000000128 by task nproc/8541

    CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0xa5/0xe6
    __kasan_report.cold+0x65/0xbb
    kasan_report+0x45/0x60
    check_memory_region+0x15e/0x1c0
    __kasan_check_read+0x15/0x20
    blk_mq_map_swqueue+0x2f2/0x830
    __blk_mq_update_nr_hw_queues+0x3df/0x690
    blk_mq_update_nr_hw_queues+0x32/0x50
    nullb_device_submit_queues_store+0xde/0x160 [null_blk]
    configfs_write_file+0x1c4/0x250 [configfs]
    __vfs_write+0x4c/0x90
    vfs_write+0x14b/0x2d0
    ksys_write+0xdd/0x180
    __x64_sys_write+0x47/0x50
    do_syscall_64+0x6f/0x310
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Signed-off-by: Weiping Zhang
    Tested-by: Bart van Assche
    Signed-off-by: Jens Axboe

    Weiping Zhang
     

11 May, 2020

1 commit