10 Jul, 2021

1 commit

  • Pull more block updates from Jens Axboe:
    "A combination of changes that ended up depending on both the driver
    and core branch (and/or the IDE removal), and a few late arriving
    fixes. In detail:

    - Fix io ticks wrap-around issue (Chunguang)

    - nvme-tcp sock locking fix (Maurizio)

    - s390-dasd fixes (Kees, Christoph)

    - blk_execute_rq polling support (Keith)

    - blk-cgroup RCU iteration fix (Yu)

    - nbd backend ID addition (Prasanna)

    - Partition deletion fix (Yufen)

    - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)

    - Removal of now dead block request types due to IDE removal
    (Christoph)

    - Loop probing and control device cleanups (Christoph)

    - Device uevent fix (Christoph)

    - Misc cleanups/fixes (Tetsuo, Christoph)"

    * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
    blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
    block: fix the problem of io_ticks becoming smaller
    nvme-tcp: can't set sk_user_data without write_lock
    loop: remove unused variable in loop_set_status()
    block: remove the bdgrab in blk_drop_partitions
    block: grab a device refcount in disk_uevent
    s390/dasd: Avoid field over-reading memcpy()
    dasd: unexport dasd_set_target_state
    block: check disk exist before trying to add partition
    ubd: remove dead code in ubd_setup_common
    nvme: use return value from blk_execute_rq()
    block: return errors from blk_execute_rq()
    nvme: use blk_execute_rq() for passthrough commands
    block: support polling through blk_execute_rq
    block: remove REQ_OP_SCSI_{IN,OUT}
    block: mark blk_mq_init_queue_data static
    loop: rewrite loop_exit using idr_for_each_entry
    loop: split loop_lookup
    loop: don't allow deleting an unspecified loop device
    loop: move loop_ctl_mutex locking into loop_add
    ...

    Linus Torvalds
     

07 Jul, 2021

2 commits

  • We run a test that create millions of cgroups and blkgs, and then trigger
    blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long
    time in such situation. Thus release the lock when a batch of blkgs are
    destroyed.

    blkcg_activate_policy() and blkcg_deactivate_policy() might have the
    same problem, however, as they are basically only called from module
    init/exit paths, let's leave them alone for now.

    Signed-off-by: Yu Kuai
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.com
    Signed-off-by: Jens Axboe

    Yu Kuai
     
  • On the IO submission path, blk_account_io_start() may interrupt
    the system interruption. When the interruption returns, the value
    of part->stamp may have been updated by other cores, so the time
    value collected before the interruption may be less than part->
    stamp. So when this happens, we should do nothing to make io_ticks
    more accurate? For kernels less than 5.0, this may cause io_ticks
    to become smaller, which in turn may cause abnormal ioutil values.

    Signed-off-by: Chunguang Xu
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.com
    Signed-off-by: Jens Axboe

    Chunguang Xu
     

05 Jul, 2021

1 commit

  • Commit d2bcbeab4200 ("scsi: blkcg: Add app identifier support for
    blkcg") introduced an FC_APPID config option under SCSI. However, the
    added config option is not used anywhere. Simply remove it.

    The block layer BLK_CGROUP_FC_APPID config option is what actually
    controls whether the application ID code should be built or not. Make
    this option dependent on NVMe over FC since that is currently the only
    transport which supports the capability.

    Fixes: d2bcbeab4200 ("scsi: blkcg: Add app identifier support for blkcg")
    Reported-by: Linus Torvalds
    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Martin K. Petersen
     

03 Jul, 2021

2 commits

  • Pull SCSI updates from James Bottomley:
    "This series consists of the usual driver updates (ufs, ibmvfc,
    megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
    elx and mpi3mr being new drivers.

    The major core change is a rework to drop the status byte handling
    macros and the old bit shifted definitions and the rest of the updates
    are minor fixes"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
    scsi: aha1740: Avoid over-read of sense buffer
    scsi: arcmsr: Avoid over-read of sense buffer
    scsi: ips: Avoid over-read of sense buffer
    scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
    scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
    scsi: elx: libefc: Fix less than zero comparison of a unsigned int
    scsi: elx: efct: Fix pointer error checking in debugfs init
    scsi: elx: efct: Fix is_originator return code type
    scsi: elx: efct: Fix link error for _bad_cmpxchg
    scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
    scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
    scsi: elx: efct: Fix error handling in efct_hw_init()
    scsi: elx: efct: Remove redundant initialization of variable lun
    scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
    scsi: lpfc: Fix build error in lpfc_scsi.c
    scsi: target: iscsi: Remove redundant continue statement
    scsi: qla4xxx: Remove redundant continue statement
    scsi: ppa: Switch to use module_parport_driver()
    scsi: imm: Switch to use module_parport_driver()
    scsi: mpt3sas: Fix error return value in _scsih_expander_add()
    ...

    Linus Torvalds
     
  • …nel/git/arnd/asm-generic

    Pull asm/unaligned.h unification from Arnd Bergmann:
    "Unify asm/unaligned.h around struct helper

    The get_unaligned()/put_unaligned() helpers are traditionally
    architecture specific, with the two main variants being the
    "access-ok.h" version that assumes unaligned pointer accesses always
    work on a particular architecture, and the "le-struct.h" version that
    casts the data to a byte aligned type before dereferencing, for
    architectures that cannot always do unaligned accesses in hardware.

    Based on the discussion linked below, it appears that the access-ok
    version is not realiable on any architecture, but the struct version
    probably has no downsides. This series changes the code to use the
    same implementation on all architectures, addressing the few
    exceptions separately"

    Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
    Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
    Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
    Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/

    * tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
    asm-generic: simplify asm/unaligned.h
    asm-generic: uaccess: 1-byte access is always aligned
    netpoll: avoid put_unaligned() on single character
    mwifiex: re-fix for unaligned accesses
    apparmor: use get_unaligned() only for multi-byte words
    partitions: msdos: fix one-byte get_unaligned()
    asm-generic: unaligned always use struct helpers
    asm-generic: unaligned: remove byteshift helpers
    powerpc: use linux/unaligned/le_struct.h on LE power7
    m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
    sh: remove unaligned access for sh4a
    openrisc: always use unaligned-struct header
    asm-generic: use asm-generic/unaligned.h for most architectures

    Linus Torvalds
     

02 Jul, 2021

2 commits


01 Jul, 2021

8 commits

  • If disk have been deleted, we should return fail for ioctl
    BLKPG_DEL_PARTITION. Otherwise, the directory /sys/class/block
    may remain invalid symlinks file. The race as following:

    blkdev_open
    del_gendisk
    disk->flags &= ~GENHD_FL_UP;
    blk_drop_partitions
    blkpg_ioctl
    bdev_add_partition
    add_partition
    device_add
    device_add_class_symlinks

    ioctl may add_partition after del_gendisk() have tried to delete
    partitions. Then, symlinks file will be created.

    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Yufen Yu
    Link: https://lore.kernel.org/r/20210610023241.3646241-1-yuyufen@huawei.com
    Signed-off-by: Jens Axboe

    Yufen Yu
     
  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Various DM persistent-data library improvements and fixes that
    benefit both the DM thinp and cache targets.

    - A few small DM kcopyd efficiency improvements.

    - Significant zoned related block core, DM core and DM zoned target
    changes that culminate with adding zoned append emulation (which is
    required to properly fix DM crypt's zoned support).

    - Various DM writecache target changes that improve efficiency. Adds an
    optional "metadata_only" feature that only promotes bios flagged with
    REQ_META. But the most significant improvement is writecache's
    ability to pause writeback, for a confiurable time, if/when the
    working set is larger than the cache (and the cache is full) -- this
    ensures performance is no worse than the slower origin device.

    * tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
    dm writecache: make writeback pause configurable
    dm writecache: pause writeback if cache full and origin being written directly
    dm io tracker: factor out IO tracker
    dm btree remove: assign new_root only when removal succeeds
    dm zone: fix dm_revalidate_zones() memory allocation
    dm ps io affinity: remove redundant continue statement
    dm writecache: add optional "metadata_only" parameter
    dm writecache: add "cleaner" and "max_age" to Documentation
    dm writecache: write at least 4k when committing
    dm writecache: flush origin device when writing and cache is full
    dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
    dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
    dm writecache: commit just one block, not a full page
    dm writecache: remove unused gfp_t argument from wc_add_block()
    dm crypt: Fix zoned block device support
    dm: introduce zone append emulation
    dm: rearrange core declarations for extended use from dm-zone.c
    block: introduce BIO_ZONE_WRITE_LOCKED bio flag
    block: introduce bio zone helpers
    block: improve handling of all zones reset operation
    ...

    Linus Torvalds
     
  • The synchronous blk_execute_rq() had not provided a way for its callers
    to know if its request was successful or not. Return the blk_status_t
    result of the request.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Keith Busch
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210610214437.641245-4-kbusch@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Poll for completions if the request's hctx is a polling type.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Keith Busch
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210610214437.641245-2-kbusch@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
    or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
    into a single one.

    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All driver uses are gone now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210624081012.256464-1-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull block driver updates from Jens Axboe:
    "Pretty calm round, mostly just NVMe and a bit of MD:

    - NVMe updates (via Christoph)
    - improve the APST configuration algorithm (Alexey Bogoslavsky)
    - look for StorageD3Enable on companion ACPI device
    (Mario Limonciello)
    - allow selecting the network interface for TCP connections
    (Martin Belanger)
    - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
    Christoph)
    - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
    for certain AMD CPUs (Mario Limonciello)
    - zoned device support for nvmet (Chaitanya Kulkarni)
    - fix the rules for changing the serial number in nvmet
    (Noam Gottlieb)
    - various small fixes and cleanups (Dan Carpenter, JK Kim,
    Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
    Uytterhoeven, Daniel Wagner)

    - MD updates (Via Song)
    - iostats rewrite (Guoqing Jiang)
    - raid5 lock contention optimization (Gal Ofri)

    - Fall through warning fix (Gustavo)

    - Misc fixes (Gustavo, Jiapeng)"

    * tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
    nvmet: use NVMET_MAX_NAMESPACES to set nn value
    loop: Fix missing discard support when using LOOP_CONFIGURE
    nvme.h: add missing nvme_lba_range_type endianness annotations
    nvme: remove zeroout memset call for struct
    nvme-pci: remove zeroout memset call for struct
    nvmet: remove zeroout memset call for struct
    nvmet: add ZBD over ZNS backend support
    nvmet: add Command Set Identifier support
    nvmet: add nvmet_req_bio put helper for backends
    nvmet: add req cns error complete helper
    block: export blk_next_bio()
    nvmet: remove local variable
    nvmet: use nvme status value directly
    nvmet: use u32 type for the local variable nsid
    nvmet: use u32 for nvmet_subsys max_nsid
    nvmet: use req->cmd directly in file-ns fast path
    nvmet: use req->cmd directly in bdev-ns fast path
    nvmet: make ver stable once connection established
    nvmet: allow mn change if subsys not discovered
    nvmet: make sn stable once connection was established
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - disk events cleanup (Christoph)

    - gendisk and request queue allocation simplifications (Christoph)

    - bdev_disk_changed cleanups (Christoph)

    - IO priority improvements (Bart)

    - Chained bio completion trace fix (Edward)

    - blk-wbt fixes (Jan)

    - blk-wbt enable/disable fix (Zhang)

    - Scheduler dispatch improvements (Jan, Ming)

    - Shared tagset scheduler improvements (John)

    - BFQ updates (Paolo, Luca, Pietro)

    - BFQ lock inversion fix (Jan)

    - Documentation improvements (Kir)

    - CLONE_IO block cgroup fix (Tejun)

    - Remove of ancient and deprecated block dump feature (zhangyi)

    - Discard merge fix (Ming)

    - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
    Yang)

    * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
    block: fix discard request merge
    block/mq-deadline: Remove a WARN_ON_ONCE() call
    blk-mq: update hctx->dispatch_busy in case of real scheduler
    blk: Fix lock inversion between ioc lock and bfqd lock
    bfq: Remove merged request already in bfq_requests_merged()
    block: pass a gendisk to bdev_disk_changed
    block: move bdev_disk_changed
    block: add the events* attributes to disk_attrs
    block: move the disk events code to a separate file
    block: fix trace completion for chained bio
    block/partitions/msdos: Fix typo inidicator -> indicator
    block, bfq: reset waker pointer with shared queues
    block, bfq: check waker only for queues with no in-flight I/O
    block, bfq: avoid delayed merge of async queues
    block, bfq: boost throughput by extending queue-merging times
    block, bfq: consider also creation time in delayed stable merge
    block, bfq: fix delayed stable merge check
    block, bfq: let also stably merged queues enjoy weight raising
    blk-wbt: make sure throttle is enabled properly
    blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
    ...

    Linus Torvalds
     

29 Jun, 2021

1 commit

  • ll_new_hw_segment() is reached only in case of single range discard
    merge, and we don't have max discard segment size limit actually, so
    it is wrong to run the following check:

    if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))

    it may be always false since req->nr_phys_segments is initialized as
    one, and bio's segment count is still 1, blk_rq_get_max_segments(reg)
    is 1 too.

    Fix the issue by not doing the check and bypassing the calculation of
    discard request's nr_phys_segments.

    Based on analysis from Wang Shanker.

    Cc: Christoph Hellwig
    Reported-by: Wang Shanker
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210628023312.1903255-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Jun, 2021

1 commit

  • The purpose of the WARN_ON_ONCE() statement in dd_insert_request() is to
    verify that dd_prepare_request() cleared rq->elv.priv[0]. Since
    dd_prepare_request() is called during request initialization but not if a
    request is requeued, a warning is triggered if a request is requeued. Fix
    this by removing the WARN_ON_ONCE() statement. This patch suppresses the
    following kernel warning:

    WARNING: CPU: 28 PID: 432 at block/mq-deadline-main.c:740 dd_insert_request+0x4d4/0x5b0
    Workqueue: kblockd blk_mq_requeue_work
    Call Trace:
    dd_insert_requests+0xfa/0x130
    blk_mq_sched_insert_request+0x22c/0x240
    blk_mq_requeue_work+0x21c/0x2d0
    process_one_work+0x4c2/0xa70
    worker_thread+0x2e5/0x6d0
    kthread+0x21c/0x250
    ret_from_fork+0x1f/0x30

    Reported-by: Sachin Sant
    Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210627211112.12720-1-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Jun, 2021

7 commits

  • Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
    starts to support io batching submission by using hctx->dispatch_busy.

    However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
    in that commit, so fix the issue by updating hctx->dispatch_busy in case
    of real scheduler.

    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

    bfqd -> ioc:
    put_io_context+0x33/0x90 -> ioc->lock grabbed
    blk_mq_free_request+0x51/0x140
    blk_put_request+0xe/0x10
    blk_attempt_req_merge+0x1d/0x30
    elv_attempt_insert_merge+0x56/0xa0
    blk_mq_sched_try_insert_merge+0x4b/0x60
    bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
    blk_mq_sched_insert_requests+0xd6/0x2b0
    blk_mq_flush_plug_list+0x154/0x280
    blk_finish_plug+0x40/0x60
    ext4_writepages+0x696/0x1320
    do_writepages+0x1c/0x80
    __filemap_fdatawrite_range+0xd7/0x120
    sync_file_range+0xac/0xf0

    ioc->bfqd:
    bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
    put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
    exit_io_context+0x48/0x50
    do_exit+0x7e9/0xdd0
    do_group_exit+0x54/0xc0

    To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
    free the merged request but rather leave that upto the caller similarly
    to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
    to free all the merged requests after dropping bfqd->lock.

    Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Reviewed-by: Ming Lei
    Acked-by: Paolo Valente
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently, bfq does very little in bfq_requests_merged() and handles all
    the request cleanup in bfq_finish_requeue_request() called from
    blk_mq_free_request(). That is currently safe only because
    blk_mq_free_request() is called shortly after bfq_requests_merged()
    while bfqd->lock is still held. However to fix a lock inversion between
    bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after
    dropping bfqd->lock. That would mean that already merged request could
    be seen by other processes inside bfq queues and possibly dispatched to
    the device which is wrong. So move cleanup of the request from
    bfq_finish_requeue_request() to bfq_requests_merged().

    Acked-by: Paolo Valente
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • bdev_disk_changed can only operate on whole devices. Make that clear
    by passing a gendisk instead of the struct block_device.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210624123240.441814-3-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Move bdev_disk_changed to block/partitions/core.c, together with the
    rest of the partition scanning code.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210624123240.441814-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add the events attributes to the disk_attrs array, which ensures they are
    added by the driver core when the device is created rather than adding
    them after the device has been added, which is racy versus uevents and
    requires more boilerplate code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210624073843.251178-3-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Move the code for handling disk events from genhd.c into a new file
    as it isn't very related to the rest of the file while at the same
    time requiring lots of forward declarations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210624073843.251178-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jun, 2021

1 commit

  • For chained bio, trace_block_bio_complete in bio_endio is currently called
    only by the parent bio once upon all chained bio completed.
    However, the sector and size for the parent bio are modified in bio_split.
    Therefore, the size and sector of the complete events might not match the
    queue events in blktrace.

    The original fix of bio completion trace ("block: trace
    completion of all bios.") wants multiple complete events to correspond
    to one queue event but missed this.

    The issue can be reproduced by md/raid5 read with bio cross chunks.

    To fix, move trace completion into the loop for every chained bio to call.

    Fixes: fbbaf700e7b1 ("block: trace completion of all bios.")
    Reviewed-by: Wade Liang
    Reviewed-by: BingJing Chang
    Signed-off-by: Edward Hsieh
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210624123030.27014-1-edwardh@synology.com
    Signed-off-by: Jens Axboe

    Edward Hsieh
     

22 Jun, 2021

14 commits

  • Just a fix for a small typo in msdos_partition().

    Signed-off-by: Thomas Bracht Laumann Jespersen
    Link: https://lore.kernel.org/r/20210619195130.19348-1-t@laumann.xyz
    Signed-off-by: Jens Axboe

    Thomas Bracht Laumann Jespersen
     
  • Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker
    mechanism") leaves shared bfq_queues out of the waker-detection
    mechanism. It attains this goal by not updating the pointer
    last_completed_rq_bfqq, if the last request completed belongs to a
    shared bfq_queue (so that the pointer will not point to the shared
    bfq_queue).

    Yet this has a side effect: the pointer last_completed_rq_bfqq keeps
    pointing, deceptively, to a bfq_queue that actually is not the last
    one to have had a request completed. As a consequence, such a
    bfq_queue may deceptively be considered as a waker of some bfq_queue,
    even of some shared bfq_queue.

    To address this issue, reset last_completed_rq_bfqq if the last
    request completed belongs to a shared queue.

    Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism")
    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of
    Q1 gets completed shortly before a new request arrives for Q2, then
    BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new
    request may have a different cause, in the following case. If also Q2
    has requests in flight while waiting for the arrival of a new request,
    then the completion of its own requests may be the actual cause of the
    awakening of the process that sends I/O to Q2. So Q1 may be flagged
    wrongly as a candidate waker.

    This commit avoids this deceptive flagging, by disabling
    candidate-waker flagging for Q2, if Q2 has in-flight I/O.

    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
    queues"), BFQ may schedule a merge between a newly created sync
    bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this
    goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq
    of the bic associated with Q2. So, when the time for the possible merge
    arrives, BFQ knows which bfq_queue to merge Q2 with. In particular,
    BFQ checks for possible merges on request arrivals.

    Yet the same bic may also be associated with an async bfq_queue, say
    Q3. So, if a request for Q3 arrives, then the above check may happen
    to be executed while the bfq_queue at hand is Q3, instead of Q2. In
    this case, Q1 happens to be merged with an async bfq_queue. This is
    not only a conceptual mistake, because async queues are to be kept out
    of queue merging, but also a bug that leads to inconsistent states.

    This commits simply filters async queues out of delayed merges.

    Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • One of the methods with which bfq boosts throughput is by merging queues.
    One of the merging variants in bfq is the stable merge.
    This mechanism is activated between two queues only if they are created
    within a certain maximum time T1 from each other.
    Merging can happen soon or be delayed. In the second case, before
    merging, bfq needs to evaluate a throughput-boost parameter that
    indicates whether the queue generates a high throughput is served alone.
    Merging occurs when this throughput-boost is not high enough.
    In particular, this parameter is evaluated and late merging may occur
    only after at least a time T2 from the creation of the queue.

    Currently T1 and T2 are set to 180ms and 200ms, respectively.
    In this way the merging mechanism rarely occurs because time is not
    enough. This results in a noticeable lowering of the overall throughput
    with some workloads (see the example below).

    This commit introduces two constants bfq_activation_stable_merging and
    bfq_late_stable_merging in order to increase the duration of T1 and T2.
    Both the stable merging activation time and the late merging
    time are set to 600ms. This value has been experimentally evaluated
    using sqlite benchmark in the Phoronix Test Suite on a HDD.
    The duration of the benchmark before this fix was 111.02s, while now
    it has reached 97.02s, a better result than that of all the other
    schedulers.

    Signed-off-by: Pietro Pedroni
    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Pietro Pedroni
     
  • Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
    queues"), BFQ may schedule a merge between a newly created sync
    bfq_queue and the last sync bfq_queue created. Such a merging is not
    performed immediately, because BFQ needs first to find out whether the
    newly created queue actually reaches a higher throughput if not merged
    at all (and in that case BFQ will not perform any stable merging). To
    check that, a little time must be waited after the creation of the new
    queue, so that some I/O can flow in the queue, and statistics on such
    I/O can be computed.

    Yet, to evaluate the above waiting time, the last split time is
    considered as start time, instead of the creation time of the
    queue. This is a mistake, because considering the split time is
    correct only in the following scenario.

    The queue undergoes a non-stable merges on the arrival of its very
    first I/O request, due to close I/O with some other queue. While the
    queue is merged for close I/O, stable merging is not considered. Yet
    the queue may then happen to be split, if the close I/O finishes (or
    happens to be a false positive). From this time on, the queue can
    again be considered for stable merging. But, again, a little time must
    elapse, to let some new I/O flow in the queue and to get updated
    statistics. To wait for this time, the split time is to be taken into
    account.

    Yet, if the queue does not undergo a non-stable merge on the arrival
    of its very first request, then BFQ immediately checks whether the
    stable merge is to be performed. It happens because the split time for
    a queue is initialized to minus infinity when the queue is created.

    This commit fixes this mistake by adding the missing condition. Now
    the check for delayed stable-merge is performed after a little time is
    elapsed not only from the last queue split time, but also from the
    creation time of the queue.

    Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • When attempting to schedule a merge of a given bfq_queue with the currently
    in-service bfq_queue or with a cooperating bfq_queue among the scheduled
    bfq_queues, delayed stable merge is checked for rotational or non-queueing
    devs. For this stable merge to be performed, some conditions must be met.
    If the current bfq_queue underwent some split from some merged bfq_queue,
    one of these conditions is that two hundred milliseconds must elapse from
    split, otherwise this condition is always met.

    Unfortunately, by mistake, time_is_after_jiffies() was written instead of
    time_is_before_jiffies() for this check, verifying that less than two
    hundred milliseconds have elapsed instead of verifying that at least two
    hundred milliseconds have elapsed.

    Fix this issue by replacing time_is_after_jiffies() with
    time_is_before_jiffies().

    Signed-off-by: Luca Mariotti
    Signed-off-by: Paolo Valente
    Signed-off-by: Pietro Pedroni
    Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Luca Mariotti
     
  • Merged bfq_queues are kept out of weight-raising (low-latency)
    mechanisms. The reason is that these queues are usually created for
    non-interactive and non-soft-real-time tasks. Yet this is not the case
    for stably-merged queues. These queues are merged just because they
    are created shortly after each other. So they may easily serve the I/O
    of an interactive or soft-real time application, if the application
    happens to spawn multiple processes.

    To address this issue, this commits lets also stably-merged queued
    enjoy weight raising.

    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • After commit a79050434b45 ("blk-rq-qos: refactor out common elements of
    blk-wbt"), if throttle was disabled by wbt_disable_default(), we could
    not enable again, fix this by set enable_state back to
    WBT_STATE_ON_DEFAULT.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Signed-off-by: Zhang Yi
    Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    Zhang Yi
     
  • Now that we disable wbt by simply zero out rwb->wb_normal in
    wbt_disable_default() when switch elevator to bfq, but it's not safe
    because it will become false positive if we change queue depth. If it
    become false positive between wbt_wait() and wbt_track() when submit
    write request, it will lead to drop rqw->inflight to -1 in wbt_done(),
    which will end up trigger IO hung. Fix this issue by introduce a new
    state which mean the wbt was disabled.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Signed-off-by: Zhang Yi
    Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    Zhang Yi
     
  • While one or more requests with a certain I/O priority are pending, do not
    dispatch lower priority requests. Dispatch lower priority requests anyway
    after the "aging" time has expired.

    This patch has been tested as follows:

    modprobe scsi_debug ndelay=1000000 max_queue=16 &&
    sd='' &&
    while [ -z "$sd" ]; do
    sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
    done &&
    echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
    cd /sys/fs/cgroup/blkio/ &&
    echo $$ >cgroup.procs &&
    echo restrict-to-be >blkio.prio.class &&
    mkdir -p hipri &&
    cd hipri &&
    echo none-to-rt >blkio.prio.class &&
    { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
    echo $$ >cgroup.procs &&
    max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt

    Result:
    * 11000 IOPS for the high-priority job
    * 40 IOPS for the low-priority job

    If the aging expiry time is changed from 100s into 0, the IOPS results change
    into 6712 and 6796 IOPS.

    The max-iops script is a script that runs fio with the following arguments:
    --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
    --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
    --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
    --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
    --filename=${positional_argument_1}

    Reviewed-by: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Maintain statistics per cgroup and export these to user space. These
    statistics are essential for verifying whether the proper I/O priorities
    have been assigned to requests. An example of the statistics data with
    this patch applied:

    $ cat /sys/fs/cgroup/io.stat
    11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
    8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0

    Cc: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Track I/O statistics per I/O priority and export these statistics to
    debugfs. These statistics help developers of the deadline scheduler.

    Cc: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE
    and IDLE. Maintain statistics for each priority level. Split the debugfs
    attributes per priority level as follows:

    $ ls /sys/kernel/debug/block/.../sched/
    async_depth dispatch2 read_next_rq write2_fifo_list
    batching read0_fifo_list starved write_next_rq
    dispatch0 read1_fifo_list write0_fifo_list
    dispatch1 read2_fifo_list write1_fifo_list

    Cc: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche