29 Oct, 2020

1 commit


24 Oct, 2020

2 commits


14 Oct, 2020

2 commits

  • A zoned device with limited resources to open or activate zones may
    return an error when the host exceeds those limits. The same command may
    be successful if retried later, but the host needs to wait for specific
    zone states before it should expect a retry to succeed. Have the block
    layer provide an appropriate status for these conditions so applications
    can distinuguish this error for special handling.

    Cc: linux-api@vger.kernel.org
    Cc: Niklas Cassel
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

27 Sep, 2020

1 commit


25 Sep, 2020

1 commit

  • commit 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE") removed the
    REQ_NOWAIT_INLINE related code, but the diff wasn't applied to
    blk_types.h somehow.

    Then commit 2771cefeac49 ("block: remove the REQ_NOWAIT_INLINE flag")
    removed the REQ_NOWAIT_INLINE flag while the BLK_QC_T_EAGAIN flag still
    remains.

    Fixes: 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE")
    Signed-off-by: Jeffle Xu
    Signed-off-by: Jens Axboe

    Jeffle Xu
     

24 Sep, 2020

1 commit


02 Sep, 2020

5 commits

  • Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
    variable to better describe the condition.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • kdev_t is long gone, so we don't need to comment a field isn't one..

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Just check if there is private data, in which case the bio must have
    originated from bio_copy_user_iov.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We can simply use a boolean flag in the bio_map_data data structure
    instead.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Two different callers use two different mutexes for updating the
    block device size, which obviously doesn't help to actually protect
    against concurrent updates from the different callers. In addition
    one of the locks, bd_mutex is rather prone to deadlocks with other
    parts of the block stack that use it for high level synchronization.

    Switch to using a new spinlock protecting just the size updates, as
    that is all we need, and make sure everyone does the update through
    the proper helper.

    This fixes a bug reported with the nvme revalidating disks during a
    hot removal operation, which can currently deadlock on bd_mutex.

    Reported-by: Xianting Tian
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Aug, 2020

1 commit


17 Jul, 2020

1 commit

  • Currently REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are defined as
    even numbers 6 and 8, such zone reset bios are treated as READ bios by
    bio_data_dir(), which is obviously misleading.

    The macro bio_data_dir() is defined in include/linux/bio.h as,
    55 #define bio_data_dir(bio) \
    56 (op_is_write(bio_op(bio)) ? WRITE : READ)

    And op_is_write() is defined in include/linux/blk_types.h as,
    397 static inline bool op_is_write(unsigned int op)
    398 {
    399 return (op & 1);
    400 }

    The convention of op_is_write() is when there is data transfer then the
    op code should be odd number, and treat as a write op. bio_data_dir()
    treats all bio direction as READ if op_is_write() reports false, and
    WRITE if op_is_write() reports true.

    Because REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL are even numbers,
    although they don't transfer data but reporting them as READ bio by
    bio_data_dir() is misleading and might be wrong. Because these two
    commands will reset the writer pointers of the resetting zones, and all
    content after the reset write pointer will be invalid and unaccessible,
    obviously they are not READ bios in any means.

    This patch changes REQ_OP_ZONE_RESET from 6 to 15, and changes
    REQ_OP_ZONE_RESET_ALL from 8 to 17. Now bios with these two op code
    can be treated as WRITE by bio_data_dir(). Although they don't transfer
    data, now we keep them consistent with REQ_OP_DISCARD and
    REQ_OP_WRITE_ZEROES with the ituition that they change on-media content
    and should be WRITE request.

    Signed-off-by: Coly Li
    Reviewed-by: Damien Le Moal
    Reviewed-by: Chaitanya Kulkarni
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Cc: Johannes Thumshirn
    Cc: Keith Busch
    Cc: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Coly Li
     

01 Jul, 2020

4 commits


24 Jun, 2020

1 commit


18 Jun, 2020

1 commit


17 May, 2020

1 commit


14 May, 2020

1 commit

  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

13 May, 2020

1 commit

  • Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
    block device. This is a no-merge write operation.

    A zone append write BIO must:
    * Target a zoned block device
    * Have a sector position indicating the start sector of the target zone
    * The target zone must be a sequential write zone
    * The BIO must not cross a zone boundary
    * The BIO size must not be split to ensure that a single range of LBAs
    is written with a single command.

    Implement these checks in generic_make_request_checks() using the
    helper function blk_check_zone_append(). To avoid write append BIO
    splitting, introduce the new max_zone_append_sectors queue limit
    attribute and ensure that a BIO size is always lower than this limit.
    Export this new limit through sysfs and check these limits in bio_full().

    Also when a LLDD can't dispatch a request to a specific zone, it
    will return BLK_STS_ZONE_RESOURCE indicating this request needs to
    be delayed, e.g. because the zone it will be dispatched to is still
    write-locked. If this happens set the request aside in a local list
    to continue trying dispatching requests such as READ requests or a
    WRITE/ZONE_APPEND requests targetting other zones. This way we can
    still keep a high queue depth without starving other requests even if
    one request can't be served due to zone write-locking.

    Finally, make sure that the bio sector position indicates the actual
    write position as indicated by the device on completion.

    Signed-off-by: Keith Busch
    [ jth: added zone-append specific add_page and merge_page helpers ]
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Keith Busch
     

29 Apr, 2020

1 commit


20 Apr, 2020

1 commit


19 Apr, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

25 Jan, 2020

1 commit

  • Add a device-mapper target "dm-default-key" which assigns an encryption
    key to bios that aren't for the contents of an encrypted file.

    This ensures that all blocks on-disk will be encrypted with some key,
    without the performance hit of file contents being encrypted twice when
    fscrypt (File-Based Encryption) is used.

    It is only appropriate to use dm-default-key when key configuration is
    tightly controlled, like it is in Android, such that all fscrypt keys
    are at least as hard to compromise as the default key.

    Compared to the original version of dm-default-key, this has been
    modified to use the new vendor-independent inline encryption framework
    (which works even when no inline encryption hardware is present), the
    table syntax has been changed to match dm-crypt, and support for
    specifying Adiantum encryption has been added. These changes also mean
    that dm-default-key now always explicitly specifies the DUN (the IV).

    Also, to handle f2fs moving blocks of encrypted files around without the
    key, and to handle ext4 and f2fs filesystems mounted without
    '-o inlinecrypt', the mapping logic is no longer "set a key on the bio
    if it doesn't have one already", but rather "set a key on the bio unless
    the bio has the bi_skip_dm_default_key flag set". Filesystems set this
    flag on *all* bios for encrypted file contents, regardless of whether
    they are encrypting/decrypting the file using inline encryption or the
    traditional filesystem-layer encryption, or moving the raw data.

    For the bi_skip_dm_default_key flag, a new field in struct bio is used
    rather than a bit in bi_opf so that fscrypt_set_bio_crypt_ctx() can set
    the flag, minimizing the changes needed to filesystems. (bi_opf is
    usually overwritten after fscrypt_set_bio_crypt_ctx() is called.)

    Bug: 137270441
    Bug: 147814592
    Change-Id: I69c9cd1e968ccf990e4ad96e5115b662237f5095
    Signed-off-by: Eric Biggers

    Eric Biggers
     

09 Dec, 2019

1 commit


22 Nov, 2019

1 commit

  • Requests that triggers flushing volatile writeback cache to disk (barriers)
    have significant effect to overall performance.

    Block layer has sophisticated engine for combining several flush requests
    into one. But there is no statistics for actual flushes executed by disk.
    Requests which trigger flushes usually are barriers - zero-size writes.

    This patch adds two iostat counters into /sys/class/block/$dev/stat and
    /proc/diskstats - count of completed flush requests and their total time.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

07 Nov, 2019

1 commit

  • Zoned block devices (ZBC and ZAC devices) allow an explicit control
    over the condition (state) of zones. The operations allowed are:
    * Open a zone: Transition to open condition to indicate that a zone will
    actively be written
    * Close a zone: Transition to closed condition to release the drive
    resources used for writing to a zone
    * Finish a zone: Transition an open or closed zone to the full
    condition to prevent write operations

    To enable this control for in-kernel zoned block device users, define
    the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
    and REQ_OP_ZONE_FINISH as well as the generic function
    blkdev_zone_mgmt() for submitting these operations on a range of zones.
    This results in blkdev_reset_zones() removal and replacement with this
    new zone magement function. Users of blkdev_reset_zones() (f2fs and
    dm-zoned) are updated accordingly.

    Contains contributions from Matias Bjorling, Hans Holmberg,
    Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

    Reviewed-by: Javier González
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ajay Joshi
    Signed-off-by: Matias Bjorling
    Signed-off-by: Hans Holmberg
    Signed-off-by: Dmitry Fomichev
    Signed-off-by: Keith Busch
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Ajay Joshi
     

31 Oct, 2019

1 commit

  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the filesystem/fscrypt that knows about and manages encryption
    contexts. As such, when the filesystem layer submits a bio to the block
    layer, and this bio eventually reaches a device driver with support for
    inline encryption, the device driver will need to have been told the
    encryption context for that bio.

    We want to communicate the encryption context from the filesystem layer
    to the storage device along with the bio, when the bio is submitted to the
    block layer. To do this, we add a struct bio_crypt_ctx to struct bio, which
    can represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    Bug: 137270441
    Test: tested as series; see Ie1b77f7615d6a7a60fdc9105c7ab2200d17636a8
    Change-Id: I479de9ec13758f1978b34d897e6956e680caeb92
    Signed-off-by: Satya Tangirala
    Link: https://patchwork.kernel.org/patch/11214719/

    Satya Tangirala
     

26 Oct, 2019

1 commit

  • Simple reordering of __bi_remaining can reduce bio size by 8 bytes that
    are now wasted on padding (measured on x86_64):

    struct bio {
    struct bio * bi_next; /* 0 8 */
    struct gendisk * bi_disk; /* 8 8 */
    unsigned int bi_opf; /* 16 4 */
    short unsigned int bi_flags; /* 20 2 */
    short unsigned int bi_ioprio; /* 22 2 */
    short unsigned int bi_write_hint; /* 24 2 */
    blk_status_t bi_status; /* 26 1 */
    u8 bi_partno; /* 27 1 */

    /* XXX 4 bytes hole, try to pack */

    struct bvec_iter bi_iter; /* 32 24 */

    /* XXX last struct has 4 bytes of padding */

    atomic_t __bi_remaining; /* 56 4 */

    /* XXX 4 bytes hole, try to pack */
    [...]
    /* size: 104, cachelines: 2, members: 19 */
    /* sum members: 96, holes: 2, sum holes: 8 */
    /* paddings: 1, sum paddings: 4 */
    /* last cacheline: 40 bytes */
    };

    Now becomes:

    struct bio {
    struct bio * bi_next; /* 0 8 */
    struct gendisk * bi_disk; /* 8 8 */
    unsigned int bi_opf; /* 16 4 */
    short unsigned int bi_flags; /* 20 2 */
    short unsigned int bi_ioprio; /* 22 2 */
    short unsigned int bi_write_hint; /* 24 2 */
    blk_status_t bi_status; /* 26 1 */
    u8 bi_partno; /* 27 1 */
    atomic_t __bi_remaining; /* 28 4 */
    struct bvec_iter bi_iter; /* 32 24 */

    /* XXX last struct has 4 bytes of padding */
    [...]
    /* size: 96, cachelines: 2, members: 19 */
    /* paddings: 1, sum paddings: 4 */
    /* last cacheline: 32 bytes */
    };

    Signed-off-by: David Sterba
    Signed-off-by: Jens Axboe

    David Sterba
     

29 Aug, 2019

1 commit

  • This patchset implements IO cost model based work-conserving
    proportional controller.

    While io.latency provides the capability to comprehensively prioritize
    and protect IOs depending on the cgroups, its protection is binary -
    the lowest latency target cgroup which is suffering is protected at
    the cost of all others. In many use cases including stacking multiple
    workload containers in a single system, it's necessary to distribute
    IO capacity with better granularity.

    One challenge of controlling IO resources is the lack of trivially
    observable cost metric. The most common metrics - bandwidth and iops
    - can be off by orders of magnitude depending on the device type and
    IO pattern. However, the cost isn't a complete mystery. Given
    several key attributes, we can make fairly reliable predictions on how
    expensive a given stream of IOs would be, at least compared to other
    IO patterns.

    The function which determines the cost of a given IO is the IO cost
    model for the device. This controller distributes IO capacity based
    on the costs estimated by such model. The more accurate the cost
    model the better but the controller adapts based on IO completion
    latency and as long as the relative costs across differents IO
    patterns are consistent and sensible, it'll adapt to the actual
    performance of the device.

    Currently, the only implemented cost model is a simple linear one with
    a few sets of default parameters for different classes of device.
    This covers most common devices reasonably well. All the
    infrastructure to tune and add different cost models is already in
    place and a later patch will also allow using bpf progs for cost
    models.

    Please see the top comment in blk-iocost.c and documentation for
    more details.

    v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Cc: Josef Bacik
    Cc: Rik van Riel
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Aug, 2019

1 commit

  • psi tracks the time tasks wait for refaulting pages to become
    uptodate, but it does not track the time spent submitting the IO. The
    submission part can be significant if backing storage is contended or
    when cgroup throttling (io.latency) is in effect - a lot of time is
    spent in submit_bio(). In that case, we underreport memory pressure.

    Annotate submit_bio() to account submission time as memory stall when
    the bio is reading userspace workingset pages.

    Tested-by: Suren Baghdasaryan
    Signed-off-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Johannes Weiner
     

05 Aug, 2019

1 commit

  • This patch introduces a new request operation REQ_OP_ZONE_RESET_ALL.
    This is useful for the applications like mkfs where it needs to reset
    all the zones present on the underlying block device. As part for this
    patch we also introduce new QUEUE_FLAG_ZONE_RESETALL which indicates the
    queue zone reset all capability and corresponding helper macro.

    Reviewed-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

22 Jul, 2019

1 commit

  • By default, if a caller sets REQ_NOWAIT and we need to block, we'll
    return -EAGAIN through the bio->bi_end_io() callback. For some use
    cases, this makes it hard to use.

    Allow a caller to ask for inline return of errors related to
    blocking by also setting REQ_NOWAIT_INLINE.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jul, 2019

1 commit

  • When a shared kthread needs to issue a bio for a cgroup, doing so
    synchronously can lead to priority inversions as the kthread can be
    trapped waiting for that cgroup. This patch implements
    REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
    to a dedicated per-blkcg work item to avoid such priority inversions.

    This will be used to fix priority inversions in btrfs compression and
    should be generally useful as we grow filesystem support for
    comprehensive IO control.

    Cc: Chris Mason
    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Jun, 2019

1 commit

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 May, 2019

1 commit