12 Jan, 2020

1 commit

  • [ Upstream commit c58c1f83436b501d45d4050fd1296d71a9760bcb ]

    Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat
    request gracefully on -EAGAIN error.

    The problem is well reproduced using io_uring:

    mkfs.ext4 /dev/ram0
    mount /dev/ram0 /mnt

    # Preallocate a file
    dd if=/dev/zero of=/mnt/file bs=1M count=1

    # Start fio with io_uring and get -EIO
    fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file

    Signed-off-by: Roman Penyaev
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Roman Penyaev
     

18 Sep, 2019

1 commit

  • Currently t10_pi_prepare/t10_pi_complete functions are called during the
    NVMe and SCSi layers command preparetion/completion, but their actual
    place should be the block layer since T10-PI is a general data integrity
    feature that is used by block storage protocols. Introduce .prepare_fn
    and .complete_fn callbacks within the integrity profile that each type
    can implement according to its needs.

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Suggested-by: Martin K. Petersen
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Max Gurtovoy

    Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
    isn't defined in the config.

    Signed-off-by: Jens Axboe

    Max Gurtovoy
     

29 Aug, 2019

1 commit


28 Aug, 2019

1 commit

  • The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
    path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
    required.

    However, when mq & iosched kobjects are removed via
    blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
    too. This way causes AB-BA lock because the kernfs built-in lock of
    'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

    On the other hand, it isn't necessary to acquire q->sysfs_lock for
    both blk_mq_unregister_dev() & elv_unregister_queue() because
    clearing REGISTERED flag prevents storing to 'queue/scheduler'
    from being happened. Also sysfs write(store) is exclusive, so no
    necessary to hold the lock for elv_unregister_queue() when it is
    called in switching elevator path.

    So split .sysfs_lock into two: one is still named as .sysfs_lock for
    covering sync .store, the other one is named as .sysfs_dir_lock
    for covering kobjects and related status change.

    sysfs itself can handle the race between add/remove kobjects and
    showing/storing attributes under kobjects. For switching scheduler
    via storing to 'queue/scheduler', we use the queue flag of
    QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
    we can avoid to hold .sysfs_lock during removing/adding kobjects.

    [1] lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __mutex_lock+0x14a/0xa9b
    blk_mq_hw_sysfs_show+0x63/0xb6
    sysfs_kf_seq_show+0x11f/0x196
    seq_read+0x2cd/0x5f2
    vfs_read+0xc7/0x18c
    ksys_read+0xc4/0x13e
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __kernfs_remove+0x237/0x40b
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&q->sysfs_lock);
    lock(kn->count#202);
    lock(&q->sysfs_lock);
    lock(kn->count#202);

    *** DEADLOCK ***

    2 locks held by rmmod/777:
    #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
    #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
    dump_stack+0x9a/0xe6
    check_noncircular+0x207/0x251
    ? print_circular_bug+0x32a/0x32a
    ? find_usage_backwards+0x84/0xb0
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    ? check_prev_add+0xc45/0xc45
    ? mark_lock+0x11b/0x804
    ? check_usage_forwards+0x1ca/0x1ca
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    ? kernfs_remove_by_name_ns+0x59/0x72
    __kernfs_remove+0x237/0x40b
    ? kernfs_remove_by_name_ns+0x59/0x72
    ? kernfs_next_descendant_post+0x7d/0x7d
    ? strlen+0x10/0x23
    ? strcmp+0x22/0x44
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    ? disk_events_poll_msecs_store+0x12b/0x12b
    ? check_flags+0x1ea/0x204
    ? mark_held_locks+0x1f/0x7a
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    ? free_module+0x39f/0x39f
    ? blkcg_maybe_throttle_current+0x8a/0x718
    ? rwlock_bug+0x62/0x62
    ? __blkcg_punt_bio_submit+0xd0/0xd0
    ? trace_hardirqs_on_thunk+0x1a/0x20
    ? mark_held_locks+0x1f/0x7a
    ? do_syscall_64+0x4c/0x295
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

19 Aug, 2019

1 commit


14 Aug, 2019

1 commit

  • psi tracks the time tasks wait for refaulting pages to become
    uptodate, but it does not track the time spent submitting the IO. The
    submission part can be significant if backing storage is contended or
    when cgroup throttling (io.latency) is in effect - a lot of time is
    spent in submit_bio(). In that case, we underreport memory pressure.

    Annotate submit_bio() to account submission time as memory stall when
    the bio is reading userspace workingset pages.

    Tested-by: Suren Baghdasaryan
    Signed-off-by: Johannes Weiner
    Signed-off-by: Jens Axboe

    Johannes Weiner
     

05 Aug, 2019

2 commits

  • This implements REQ_OP_ZONE_RESET_ALL as a special case of the block
    device zone reset operations where we just simply issue bio with the
    newly introduced req op.

    We issue this req op when the number of sectors is equal to the device's
    partition's number of sectors and device has no partitions.

    We also add support so that blk_op_str() can print the new reset-all
    zone operation.

    This patch also adds a generic make request check for newly
    introduced REQ_OP_ZONE_RESET_ALL req_opf. We simply return error
    when queue is zoned and reset-all flag is not set for
    REQ_OP_ZONE_RESET_ALL.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Damien Le Moal
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • Change a reference to the legacy block layer into a reference to blk-mq.

    Reviewed-by: Chaitanya Kulkarni
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: James Smart
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Dongli Zhang
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

10 Jul, 2019

2 commits

  • When a shared kthread needs to issue a bio for a cgroup, doing so
    synchronously can lead to priority inversions as the kthread can be
    trapped waiting for that cgroup. This patch implements
    REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
    to a dedicated per-blkcg work item to avoid such priority inversions.

    This will be used to fix priority inversions in btrfs compression and
    should be generally useful as we grow filesystem support for
    comprehensive IO control.

    Cc: Chris Mason
    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • We discovered a problem in newer kernels where a disconnect of a NBD
    device while the flush request was pending would result in a hang. This
    is because the blk mq timeout handler does

    if (!refcount_inc_not_zero(&rq->ref))
    return true;

    to determine if it's ok to run the timeout handler for the request.
    Flush_rq's don't have a ref count set, so we'd skip running the timeout
    handler for this request and it would just sit there in limbo forever.

    Fix this by always setting the refcount of any request going through
    blk_init_rq() to 1. I tested this with a nbd-server that dropped flush
    requests to verify that it hung, and then tested with this patch to
    verify I got the timeout as expected and the error handling kicked in.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

21 Jun, 2019

8 commits

  • Improve the print_req_error with additional request fields which are
    helpful for debugging. Use newly introduced blk_op_str() to print the
    REQ_OP_XXX in the string format.

    Reviewed-by: Chao Yu
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • In order to centralize the REQ_OP_XXX to string conversion which can be
    used in the block layer and different places in the kernel like f2fs,
    this patch adds a new helper function along with an array similar to the
    one present in the blk-mq-debugfs.c.

    We keep this helper functionality centralize under blk-core.c instead of
    blk-mq-debugfs.c since blk-core.c is configured using CONFIG_BLOCK and
    it will not be dependent on blk-mq-debugfs.c which is configured using
    CONFIG_BLK_DEBUG_FS.

    Next patch adjusts the code in the blk-mq-debugfs.c with newly
    introduced helper.

    Reviewed-by: Bart Van Assche
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • Print the calling function instead of print_req_error as a prefix, and
    print the operation and op_flags separately instead of the whole field.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This function just has a few trivial assignments, has two callers with
    one of them being in the fastpath.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Return the segement and let the callers assign them, which makes the code
    a littler more obvious. Also pass the request instead of q plus bio
    chain, allowing for the use of rq_for_each_bvec.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • lightnvm should have never used this function, as it is sending
    passthrough requests, so switch it to blk_rq_append_bio like all the
    other passthrough request users. Inline blk_init_request_from_bio into
    the only remaining caller.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Minwoo Im
    Reviewed-by: Javier González
    Reviewed-by: Matias Bjørling
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The priority field also makes sense for passthrough requests, so
    initialize it in blk_rq_bio_prep.

    Reviewed-by: Minwoo Im
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Jun, 2019

1 commit


07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jun, 2019

1 commit

  • While troubleshooting issues where cloned request limits have been
    exceeded, it is often beneficial to know the actual values that
    have been breached. Print these values, assisting in ease of
    identification of root cause of the breach.

    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Ming Lei
    Signed-off-by: John Pittman
    Signed-off-by: Jens Axboe

    John Pittman
     

29 May, 2019

2 commits

  • Now a063057d7c73 ("block: Fix a race between request queue removal and
    the block cgroup controller") has been reverted, and blkcg_exit_queue()
    won't be called in blk_cleanup_queue() any more.

    So don't need to protect generic_make_request_checks() with
    blk_queue_enter(), then the total mess can be cleaned.

    37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device
    removal triggers a crash") is reverted.

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

24 May, 2019

1 commit

  • The following is a description of a hang in blk_mq_freeze_queue_wait().
    The hang happens on attempt to freeze a queue while another task does
    queue unfreeze.

    The root cause is an incorrect sequence of percpu_ref_resurrect() and
    percpu_ref_kill() and as a result those two can be swapped:

    CPU#0 CPU#1
    ---------------- -----------------
    q1 = blk_mq_init_queue(shared_tags)

    q2 = blk_mq_init_queue(shared_tags):
    blk_mq_add_queue_tag_set(shared_tags):
    blk_mq_update_tag_set_depth(shared_tags):
    list_for_each_entry()
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    > blk_mq_freeze_queue_wait()

    blk_cleanup_queue(q1)
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    ^^^^^^ freeze_depth can't guarantee the order

    blk_mq_unfreeze_queue()
    > percpu_ref_resurrect()

    > blk_mq_freeze_queue_wait()
    ^^^^^^ Hang here!!!!

    This wrong sequence raises kernel warning:
    percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
    WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

    But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
    which waits for a zero of a q_usage_counter, which never happens
    because percpu-ref was reinited (instead of being killed) and stays in
    PERCPU state forever.

    How to reproduce:
    - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
    - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
    - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

    Script.py:
    ------
    #!/usr/bin/python3

    import os
    import sys

    while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
    ------

    This bug was first reported and fixed by Roman, previous discussion:
    [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
    [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
    [3] https://patchwork.kernel.org/patch/9268199/

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Roman Pen
    Signed-off-by: Bob Liu
    Signed-off-by: Jens Axboe

    Bob Liu
     

04 May, 2019

4 commits

  • Now freeing hw queue resource is moved to hctx's release handler,
    we don't need to worry about the race between blk_cleanup_queue and
    run queue any more.

    So don't drain in-progress dispatch in blk_cleanup_queue().

    This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
    freeing queue").

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • hctx is always released after requeue is freed.

    With holding queue's kobject refcount, it is safe for driver to run queue,
    so one run queue might be scheduled after blk_sync_queue() is done.

    So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
    for avoiding run released queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • With holding queue's kobject refcount, it is safe for driver
    to schedule requeue. However, blk_mq_kick_requeue_list() may
    be called after blk_sync_queue() is done because of concurrent
    requeue activities, then requeue work may not be completed when
    freeing queue, and kernel oops is triggered.

    So moving the cancel of requeue_work into blk_mq_release() for
    avoiding race between requeue and freeing queue.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 May, 2019

1 commit


05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

13 Mar, 2019

1 commit

  • All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
    pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
    simplifies the expression in every filesystem. Also rename the macro to
    VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
    VM_MIN_READAHEAD

    [akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen]
    Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Matthew Wilcox
    Reviewed-by: David Hildenbrand
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Latchesar Ionkov
    Cc: Dominique Martinet
    Cc: David Howells
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: David Sterba
    Cc: Miklos Szeredi
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     

30 Jan, 2019

1 commit

  • syzbot is hitting flush_work() warning caused by commit 4d43d395fed12463
    ("workqueue: Try to catch flush_work() without INIT_WORK().") [1].
    Although that commit did not expect INIT_WORK(NULL) case, calling
    flush_work() without setting a valid callback should be avoided anyway.
    Fix this problem by setting a no-op callback instead of NULL.

    [1] https://syzkaller.appspot.com/bug?id=e390366bc48bc82a7c668326e0663be3b91cbd29

    Signed-off-by: Tetsuo Handa
    Reported-and-tested-by: syzbot
    Cc: Tejun Heo
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

27 Jan, 2019

1 commit


23 Jan, 2019

1 commit

  • Except for blk_queue_split(), bio_split() is used for splitting bio too,
    then the remained bio is often resubmit to queue via generic_make_request().
    So the same queue enter recursion exits in this case too. Unfortunatley
    commit cd4a4ae4683dc2 doesn't help this case.

    This patch covers the above case by setting BIO_QUEUE_ENTERED before calling
    q->make_request_fn.

    In theory the per-bio flag is used to simulate one stack variable, it is
    just fine to clear it after q->make_request_fn is returned. Especially
    the same bio can't be submitted from another context.

    Fixes: cd4a4ae4683dc2 ("block: don't use blocking queue entered for recursive bio submits")
    Cc: Tetsuo Handa
    Cc: NeilBrown
    Reviewed-by: Mike Snitzer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Jan, 2019

1 commit

  • Commit 5f0ed774ed29 ("block: sum requests in the plug structure") removed
    the request_count parameter from block_attempt_plug_merge(), but did not
    remove the associated kerneldoc comment, introducing this warning to the
    docs build:

    ./block/blk-core.c:685: warning: Excess function parameter 'request_count' description in 'blk_attempt_plug_merge'

    Remove the obsolete description and make things a little quieter.

    Signed-off-by: Jonathan Corbet
    Signed-off-by: Jens Axboe

    Jonathan Corbet
     

09 Jan, 2019

1 commit

  • There was some confusion about what these functions did. Make it clear
    that this is a hint for upper layers to pass to the block layer, and
    that it does not guarantee that I/O will not be submitted between a
    start and finish plug.

    Reported-by: "Darrick J. Wong"
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Ming Lei
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

17 Dec, 2018

1 commit


16 Dec, 2018

1 commit


10 Dec, 2018

1 commit

  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka