06 Oct, 2020

9 commits

  • Also move the definition from the public blkdev.h to the private
    block/blk.h header.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Also move the definition from the public blkdev.h to the private
    block/blk.h header.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The field of 'q_usage_counter' is always fetched in fast path of every
    block driver, and move it into front of 'request_queue', so it can be
    fetched into 1st cacheline of 'request_queue' instance.

    Signed-off-by: Ming Lei
    Tested-by: Veronika Kabatova
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • 'struct percpu_ref' is often embedded into one user structure, and the
    instance is usually referenced in fast path, however actually only
    'percpu_count_ptr' is needed in fast path.

    So move other fields into one new structure of 'percpu_ref_data', and
    allocate it dynamically via kzalloc(), then memory footprint of
    'percpu_ref' in fast path is reduced a lot and becomes suitable to put
    into hot cacheline of user structure.

    Signed-off-by: Ming Lei
    Tested-by: Veronika Kabatova
    Reviewed-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Cc: Sagi Grimberg
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • bio_crypt_set_ctx() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    For now this assumption is still fine, since no callers violate it.
    Making bio_crypt_set_ctx() able to fail would add unneeded complexity.

    However, if a caller didn't use __GFP_DIRECT_RECLAIM, it would be very
    hard to notice the bug. Make it easier by adding a WARN_ON_ONCE().

    Signed-off-by: Eric Biggers
    Reviewed-by: Satya Tangirala
    Cc: Miaohe Lin
    Cc: Satya Tangirala
    Signed-off-by: Jens Axboe

    Eric Biggers
     
  • blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
    setup_clone() in drivers/md/dm-rq.c.

    This case isn't currently reachable with a bio that actually has an
    encryption context. However, it's fragile to rely on this. Just make
    blk_crypto_rq_bio_prep() able to fail.

    Suggested-by: Satya Tangirala
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Snitzer
    Reviewed-by: Satya Tangirala
    Cc: Miaohe Lin
    Signed-off-by: Jens Axboe

    Eric Biggers
     
  • bio_crypt_clone() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    However, bio_crypt_clone() might be called with GFP_ATOMIC via
    setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
    kcryptd_io_read() in drivers/md/dm-crypt.c.

    Neither case is currently reachable with a bio that actually has an
    encryption context. However, it's fragile to rely on this. Just make
    bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

    Reported-by: Miaohe Lin
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Snitzer
    Reviewed-by: Satya Tangirala
    Cc: Satya Tangirala
    Signed-off-by: Jens Axboe

    Eric Biggers
     
  • All remaining callers of bdget() outside of fs/block_dev.c want to get a
    reference to the struct block_device for a given struct hd_struct. Add
    a helper just for that and then mark bdget static.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • DRBD keeps a block device open just to get and set the capacity from
    it. Switch to primarily using the disk capacity as intended by the
    block layer, and sync it to the bdev using revalidate_disk_size.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

29 Sep, 2020

1 commit

  • 'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
    blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
    rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
    callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
    Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
    remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
    update the confused comments in blk_mq_queue_tag_busy_iter.

    Signed-off-by: yangerkun
    Signed-off-by: Jens Axboe

    yangerkun
     

28 Sep, 2020

1 commit

  • We found blk_mq_alloc_rq_maps() takes more time in kernel space when
    testing nvme device hot-plugging. The test and anlysis as below.

    Debug code,
    1, blk_mq_alloc_rq_maps():
    u64 start, end;
    depth = set->queue_depth;
    start = ktime_get_ns();
    pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
    current->pid, current->comm, current->nvcsw, current->nivcsw,
    set->queue_depth, set->nr_hw_queues);
    do {
    err = __blk_mq_alloc_rq_maps(set);
    if (!err)
    break;

    set->queue_depth >>= 1;
    if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
    err = -ENOMEM;
    break;
    }
    } while (set->queue_depth);
    end = ktime_get_ns();
    pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
    current->pid, current->comm,
    current->nvcsw, current->nivcsw, end - start);

    2, __blk_mq_alloc_rq_maps():
    u64 start, end;
    for (i = 0; i < set->nr_hw_queues; i++) {
    start = ktime_get_ns();
    if (!__blk_mq_alloc_rq_map(set, i))
    goto out_unwind;
    end = ktime_get_ns();
    pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
    }

    Test nvme hot-plugging with above debug code, we found it totally cost more
    than 3ms in kernel space without being scheduled out when alloc rqs for all
    16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
    time will be increased with hw queue number and queue depth increasing. And
    in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
    "queue_depth >>= 1", more time will be consumed.
    [ 428.428771] nvme nvme0: pci function 10000:01:00.0
    [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
    [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
    [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
    [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
    [ 432.593404] hw queue 0 init cost time 22883 ns
    [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
    [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
    [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
    [ 432.596203] hw queue 0 init cost time 242630 ns
    [ 432.596441] hw queue 1 init cost time 235913 ns
    [ 432.596659] hw queue 2 init cost time 216461 ns
    [ 432.596877] hw queue 3 init cost time 215851 ns
    [ 432.597107] hw queue 4 init cost time 228406 ns
    [ 432.597336] hw queue 5 init cost time 227298 ns
    [ 432.597564] hw queue 6 init cost time 224633 ns
    [ 432.597785] hw queue 7 init cost time 219954 ns
    [ 432.597937] hw queue 8 init cost time 150930 ns
    [ 432.598082] hw queue 9 init cost time 143496 ns
    [ 432.598231] hw queue 10 init cost time 147261 ns
    [ 432.598397] hw queue 11 init cost time 164522 ns
    [ 432.598542] hw queue 12 init cost time 143401 ns
    [ 432.598692] hw queue 13 init cost time 148934 ns
    [ 432.598841] hw queue 14 init cost time 147194 ns
    [ 432.598991] hw queue 15 init cost time 148942 ns
    [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
    [ 432.602611] nvme0n1: p1

    So use this patch to trigger schedule between each hw queue init, to avoid
    other threads getting stuck. It is not in atomic context when executing
    __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

    Signed-off-by: Xianting Tian
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Xianting Tian
     

25 Sep, 2020

29 commits

  • An iocg may have 0 debt but non-zero delay. The current debt forgiveness
    logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
    iocg with a little bit of debt will have its delay canceled through debt
    forgiveness but one w/o any debt but active delay will have to wait out
    until its delay decays out.

    This patch updates the debt handling logic so that it treats delays the same
    as debts. If either debt or delay is active, debt forgiveness logic kicks in
    and acts on both the same way.

    Also, avoid turning the debt and delay directly to zero as that can confuse
    state transitions.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Debt forgiveness logic was counting the number of consecutive !busy periods
    as the trigger condition. While this usually works, it can easily be thrown
    off by temporary fluctuations especially on configurations w/ short periods.

    This patch reimplements debt forgiveness so that:

    * Use the average usage over the forgiveness period instead of counting
    consecutive periods.

    * Debt is reduced at around the target rate (1/2 every 100ms) regardless of
    ioc period duration.

    * Usage threshold is raised to 50%. Combined with the preceding changes and
    the switch to average usage, this makes debt forgivness a lot more
    effective at reducing the amount of unnecessary idleness.

    * Constants are renamed with DFGV_ prefix.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Debt sets the initial delay duration which is decayed over time. The current
    debt reduction halved the debt but didn't change the delay. It prevented
    future debts from increasing delay but didn't do anything to lower the
    existing delay, limiting the mechanism's ability to reduce unnecessary
    idling.

    Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
    recalculates new delay value based on the reduced debt amount.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Debt reduction was blocked if any iocg was short on budget in the past
    period to avoid reducing debts while some iocgs are saturated. However, this
    ends up unnecessarily blocking debt reduction due to temporary local
    imbalances when the device is generally being underutilized, while also
    failing to block when the underlying device is overwhelmed and the usage
    becomes low from high latency.

    Given that debt accumulation mostly happens with swapout bursts which can
    significantly deteriorate the underlying device's latency response, the
    current logic is not great.

    Let's replace it with ioc->busy_level based condition so that we block debt
    reduction when the underlying device is being saturated. ioc_forgive_debts()
    call is moved after busy_level determination.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Debt reduction logic is going to be improved and expanded. Factor it out
    into ioc_forgive_debts() and generalize the comment a bit. No functional
    change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add DM target feature flag DM_TARGET_NOWAIT which advertises that
    target works with REQ_NOWAIT bios.

    Add dm_table_supports_nowait() and update dm_table_set_restrictions()
    to set/clear QUEUE_FLAG_NOWAIT accordingly.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
    REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
    applicable.

    Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
    update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.

    Reported-by: Konstantin Khlebnikov
    Suggested-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • No need to go through the hd_struct to find the partition number.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No need to go through the hd_struct to find the partition number.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bd_contains is never NULL for an open block device. In addition ibd_bd
    is always set to a block device that was exclusively opened by the
    target code, so the holder is guranteed to be ib_dev as well.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The ->bd_contains field is set by __blkdev_get and drivers have no
    business manipulating it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bd_disk is set on all block devices, including those for partitions.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bd_disk is set on all block devices, including those for partitions.

    Signed-off-by: Christoph Hellwig
    Acked-by: Song Liu
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • To check for partitions of the same disk bd_contains works as well, but
    bd_disk is way more obvious.

    Signed-off-by: Christoph Hellwig
    Acked-by: Song Liu
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add a littler helper to make the somewhat arcane bd_contains checks a
    little more obvious.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ulf Hansson
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bd_contains is an implementation detail and should not be mentioned in
    a userspace API documentation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace the two negative flags that are always used together with a
    single positive flag that indicates the writeback capability instead
    of two related non-capabilities. Also remove the pointless wrappers
    to just check the flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
    make the checks more obvious. Also remove the pointless
    bdi_cap_account_writeback wrapper that just obsfucates the check.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
    is not set, as the device won't support it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
    decided if ->rw_page can be used on a block device. Just check up for
    the method instead. The only complication is that zram needs a second
    set of block_device_operations as it can switch between modes that
    actually support ->rw_page and those who don't.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
    Either the file system allocates its own bdi (e.g. btrfs), in which case
    it is known to support cgroup writeback, or the bdi comes from the block
    layer, which always supports cgroup writeback.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers shouldn't really mess with the readahead size, as that is a VM
    concept. Instead set it based on the optimal I/O size by lifting the
    algorithm from the md driver when registering the disk. Also set
    bdi->io_pages there as well by applying the same scheme based on
    max_sectors. To ensure the limits work well for stacking drivers a
    new helper is added to update the readahead limits from the block
    limits, which is also called from disk_stack_limits.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Jan Kara
    Reviewed-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The raid5 and raid10 drivers currently update the read-ahead size,
    but not the optimal I/O size on reshape. To prepare for deriving the
    read-ahead size from the optimal I/O size make sure it is updated
    as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Acked-by: Song Liu
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Set up a readahead size by default, as very few users have a good
    reason to change it. This means code, ecryptfs, and orangefs now
    set up the values while they were previously missing it, while ubifs,
    mtd and vboxsf manually set it to 0 to avoid readahead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Acked-by: David Sterba [btrfs]
    Acked-by: Richard Weinberger [ubifs, mtd]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • aoe forces a larger readahead size, but any reason to do larger I/O
    is not limited to readahead. Also set the optimal I/O size, and
    remove the local constants in favor of just using SZ_2G.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Inherit the optimal I/O size setting just like the readahead window,
    as any reason to do larger I/O does not apply to just readahead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Martin K. Petersen
    Acked-by: Coly Li
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Ever since the switch to blk-mq, a lower device not used for VM
    writeback will not be marked congested, so the check will never
    trigger.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig