21 Nov, 2018

1 commit

  • commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

    c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
    already fixed this race, however the implied synchronize_rcu()
    in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
    performance regression.

    Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    tried to quiesce queue for avoiding unnecessary synchronize_rcu()
    only when queue initialization is done, because it is usual to see
    lots of inexistent LUNs which need to be probed.

    However, turns out it isn't safe to quiesce queue only when queue
    initialization is done. Because when one SCSI command is completed,
    the user of sending command can be waken up immediately, then the
    scsi device may be removed, meantime the run queue in scsi_end_request()
    is still in-progress, so kernel panic can be caused.

    In Red Hat QE lab, there are several reports about this kind of kernel
    panic triggered during kernel booting.

    This patch tries to address the issue by grabing one queue usage
    counter during freeing one request and the following run queue.

    Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    Cc: Andrew Jones
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K. Petersen
    Cc: Christoph Hellwig
    Cc: James E.J. Bottomley
    Cc: stable
    Cc: jianchao.wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

22 Sep, 2018

1 commit

  • Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
    updating properly on 4.18. This is because we started using ktime to
    track elapsed time, and we convert nanoseconds to jiffies when we update
    the partition counter. However, this gets rounded down, so any I/Os that
    take less than a jiffy are not accounted for. Previously in this case,
    the value of jiffies would sometimes increment while we were doing I/O,
    so at least some I/Os were accounted for.

    Let's convert the stats to use nanoseconds internally. We still report
    milliseconds as before, now more accurately than ever. The value is
    still truncated to 32 bits for backwards compatibility.

    Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
    Cc: stable@vger.kernel.org
    Reported-by: Klaus Kusche
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

06 Sep, 2018

1 commit

  • It is possible to call fsync on a read-only handle (for example, fsck.ext2
    does it when doing read-only check), and this call results in kernel
    warning.

    The patch b089cfd95d32 ("block: don't warn for flush on read-only device")
    attempted to disable the warning, but it is buggy and it doesn't
    (op_is_flush tests flags, but bio_op strips off the flags).

    Signed-off-by: Mikulas Patocka
    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Cc: stable@vger.kernel.org # 4.18
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

23 Aug, 2018

1 commit

  • Pull more block updates from Jens Axboe:

    - Set of bcache fixes and changes (Coly)

    - The flush warn fix (me)

    - Small series of BFQ fixes (Paolo)

    - wbt hang fix (Ming)

    - blktrace fix (Steven)

    - blk-mq hardware queue count update fix (Jianchao)

    - Various little fixes

    * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
    block/DAC960.c: make some arrays static const, shrinks object size
    blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
    blk-mq: init hctx sched after update ctx and hctx mapping
    block: remove duplicate initialization
    tracing/blktrace: Fix to allow setting same value
    pktcdvd: fix setting of 'ret' error return for a few cases
    block: change return type to bool
    block, bfq: return nbytes and not zero from struct cftype .write() method
    block, bfq: improve code of bfq_bfqq_charge_time
    block, bfq: reduce write overcharge
    block, bfq: always update the budget of an entity when needed
    block, bfq: readd missing reset of parent-entity service
    blk-wbt: fix IO hang in wbt_wait()
    block: don't warn for flush on read-only device
    bcache: add the missing comments for smp_mb()/smp_wmb()
    bcache: remove unnecessary space before ioctl function pointer arguments
    bcache: add missing SPDX header
    bcache: move open brace at end of function definitions to next line
    bcache: add static const prefix to char * array declarations
    bcache: fix code comments style
    ...

    Linus Torvalds
     

18 Aug, 2018

1 commit

  • This patch removes the duplicate initialization of q->queue_head
    in the blk_alloc_queue_node(). This removes the 2nd initialization
    so that we preserve the initialization order same as declaration
    present in struct request_queue.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

15 Aug, 2018

2 commits

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     
  • Don't warn for a flush issued to a read-only device. It's not strictly
    a writable command, as it doesn't change any on-media data by itself.

    Reported-by: Stefan Agner
    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Aug, 2018

1 commit


05 Aug, 2018

1 commit

  • It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
    requests to read-only partitions"), while obviously correct, causes
    problems for some older lvm2 installations.

    The reason is that the lvm snapshotting will continue to write to the
    snapshow COW volume, even after the volume has been marked read-only.
    End result: snapshot failure.

    This has actually been fixed in newer version of the lvm2 tool, but the
    old tools still exist, and the breakage was reported both in the kernel
    bugzilla and in the Debian bugzilla:

    https://bugzilla.kernel.org/show_bug.cgi?id=200439
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

    The lvm2 fix is here

    https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

    but until everybody has updated to recent versions, we'll have to weaken
    the "never write to read-only partitions" check. It now allows the
    write to happen, but causes a warning, something like this:

    generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
    Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
    CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
    Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
    Workqueue: ksnaphd do_metadata
    RIP: 0010:generic_make_request_checks+0x4ac/0x600
    ...
    Call Trace:
    generic_make_request+0x64/0x400
    submit_bio+0x6c/0x140
    dispatch_io+0x287/0x430
    sync_io+0xc3/0x120
    dm_io+0x1f8/0x220
    do_metadata+0x1d/0x30
    process_one_work+0x1b9/0x3e0
    worker_thread+0x2b/0x3c0
    kthread+0x113/0x130
    ret_from_fork+0x35/0x40

    Note that this is a "revert" in behavior only. I'm leaving alone the
    actual code cleanups in commit 721c7fc701c7, but letting the previously
    uncaught request go through with a warning instead of stopping it.

    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Reported-and-tested-by: WGH
    Acked-by: Mike Snitzer
    Cc: Sagi Grimberg
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Zdenek Kabelac
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Aug, 2018

1 commit

  • Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
    disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
    it can't take effect in that way since user space still can switch
    it on via 'echo auto > /sys/block/sdN/device/power/control'.

    This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
    and fixes all kinds of PM related kernel crash.

    Cc: Tomas Janousek
    Cc: Przemek Socha
    Cc: Alan Stern
    Cc:
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Tested-by: Patrick Steinhardt
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Jul, 2018

1 commit

  • We find the memory use-after-free issue in __blk_drain_queue()
    on the kernel 4.14. After read the latest kernel 4.18-rc6 we
    think it has the same problem.

    Memory is allocated for q->fq in the blk_init_allocated_queue().
    If the elevator init function called with error return, it will
    run into the fail case to free the q->fq.

    Then the __blk_drain_queue() uses the same memory after the free
    of the q->fq, it will lead to the unpredictable event.

    The patch is to set q->fq as NULL in the fail case of
    blk_init_allocated_queue().

    Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
    Cc:
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: xiao jin
    Signed-off-by: Jens Axboe

    xiao jin
     

18 Jul, 2018

1 commit

  • Add and use a new op_stat_group() function for indexing partition stat
    fields rather than indexing them by rq_data_dir() or bio_data_dir().
    This function works similarly to op_is_sync() in that it takes the
    request::cmd_flags or bio::bi_opf flags and determines which stats
    should et updated.

    In addition, the second parameter to generic_start_io_acct() and
    generic_end_io_acct() is now a REQ_OP rather than simply a read or
    write bit and it uses op_stat_group() on the parameter to determine
    the stat group.

    Note that the partition in_flight counts are not part of the per-cpu
    statistics and as such are not indexed via this function. It's now
    indexed by op_is_write().

    tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Joshua Morris
    Cc: Philipp Reisner
    Cc: Matias Bjorling
    Cc: Kent Overstreet
    Cc: Alasdair Kergon
    Signed-off-by: Jens Axboe

    Michael Callahan
     

09 Jul, 2018

5 commits

  • With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
    switch (rq->q->rpm_status) {
    ^

    Convert the return statement below the switch() block into a default
    case to fix this.

    Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     
  • We don't really need to save this stuff in the core block code, we can
    just pass the bio back into the helpers later on to derive the same
    flags and update the rq->wbt_flags appropriately.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • The payload of struct request is stored in the request.bio chain if
    the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
    RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
    iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
    set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
    blk_rq_bytes() which means that the value returned by that function
    is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
    clear to me whether this is an oversight or whether this happened on
    purpose. Anyway, document that it is known that both functions ignore
    RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b988 ("block: improve
    handling of the magic discard payload").

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • SCSI probing may synchronously create and destroy a lot of request_queues
    for non-existent devices. Any synchronize_rcu() in queue creation or
    destroy path may introduce long latency during booting, see detailed
    description in comment of blk_register_queue().

    This patch removes one synchronize_rcu() inside blk_cleanup_queue()
    for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
    needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
    when queue isn't initialized, it isn't necessary to do that since
    only pass-through requests are involved, no original issue in
    scsi_execute() at all.

    Without this patch and previous one, it may take more 20+ seconds for
    virtio-scsi to complete disk probe. With the two patches, the time becomes
    less than 100ms.

    Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
    Reported-by: Andrew Jones
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Jun, 2018

1 commit

  • This patch avoids that removing a path controlled by the dm-mpath driver
    while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:

    dm_softirq_done+0x326/0x3d0 [dm_mod]
    blk_done_softirq+0x19b/0x1e0
    __do_softirq+0x128/0x60d
    irq_exit+0x100/0x110
    smp_call_function_single_interrupt+0x90/0x330
    call_function_single_interrupt+0xf/0x20

    Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Jun, 2018

1 commit

  • Commit 0ba99ca4838b ("block: Add warning for bi_next not NULL in
    bio_endio()") breaks the dm driver. end_clone_bio() detects whether
    or not a bio is the last bio associated with a request by checking
    the .bi_next field. Commit 0ba99ca4838b clears that field before
    end_clone_bio() has had a chance to inspect that field. Hence revert
    commit 0ba99ca4838b.

    This patch avoids that KASAN reports the following complaint when
    running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):

    ==================================================================
    BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
    Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9

    CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0xa4/0xf5
    print_address_description+0x6f/0x270
    kasan_report+0x241/0x360
    __asan_load4+0x78/0x80
    bio_advance+0x11b/0x1d0
    blk_update_request+0xa7/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d
    run_ksoftirqd+0x3f/0x60
    smpboot_thread_fn+0x352/0x460
    kthread+0x1c1/0x1e0
    ret_from_fork+0x24/0x30

    Allocated by task 1918:
    save_stack+0x43/0xd0
    kasan_kmalloc+0xad/0xe0
    kasan_slab_alloc+0x11/0x20
    kmem_cache_alloc+0xfe/0x350
    mempool_alloc_slab+0x15/0x20
    mempool_alloc+0xfb/0x270
    bio_alloc_bioset+0x244/0x350
    submit_bh_wbc+0x9c/0x2f0
    __block_write_full_page+0x299/0x5a0
    block_write_full_page+0x16b/0x180
    blkdev_writepage+0x18/0x20
    __writepage+0x42/0x80
    write_cache_pages+0x376/0x8a0
    generic_writepages+0xbe/0x110
    blkdev_writepages+0xe/0x10
    do_writepages+0x9b/0x180
    __filemap_fdatawrite_range+0x178/0x1c0
    file_write_and_wait_range+0x59/0xc0
    blkdev_fsync+0x46/0x80
    vfs_fsync_range+0x66/0x100
    do_fsync+0x3d/0x70
    __x64_sys_fsync+0x21/0x30
    do_syscall_64+0x77/0x230
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 9:
    save_stack+0x43/0xd0
    __kasan_slab_free+0x137/0x190
    kasan_slab_free+0xe/0x10
    kmem_cache_free+0xd3/0x380
    mempool_free_slab+0x17/0x20
    mempool_free+0x63/0x160
    bio_free+0x81/0xa0
    bio_put+0x59/0x60
    end_bio_bh_io_sync+0x5d/0x70
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    end_clone_bio+0xa3/0xd0 [dm_mod]
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d

    The buggy address belongs to the object at ffff8801300e0640
    which belongs to the cache bio-0 of size 200
    The buggy address is located 144 bytes inside of
    200-byte region [ffff8801300e0640, ffff8801300e0708)
    The buggy address belongs to the page:
    page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
    flags: 0x8000000000008100(slab|head)
    raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
    raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
    ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    >ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    Cc: Kent Overstreet
    Fixes: 0ba99ca4838b ("block: Add warning for bi_next not NULL in bio_endio()")
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

07 Jun, 2018

1 commit

  • blk_partition_remap() will only clear bi_partno if an actual remapping
    has happened. But flush request et al don't have an actual size, so
    the remapping doesn't happen and bi_partno is never cleared.
    So for stacked devices blk_partition_remap() will be called on each level.
    If (as is the case for native nvme multipathing) one of the lower-level
    devices do _not_support partitioning a spurious I/O error is generated.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

03 Jun, 2018

1 commit

  • If we end up splitting a bio and the queue goes away between
    the initial submission and the later split submission, then we
    can block forever in blk_queue_enter() waiting for the reference
    to drop to zero. This will never happen, since we already hold
    a reference.

    Mark a split bio as already having entered the queue, so we can
    just use the live non-blocking queue enter variant.

    Thanks to Tetsuo Handa for the analysis.

    Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jun, 2018

3 commits


31 May, 2018

1 commit


29 May, 2018

1 commit

  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

15 May, 2018

2 commits


14 May, 2018

4 commits


09 May, 2018

3 commits

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • issue_stat is going to go away, so first make writeback throttling take
    the containing request, update the internal wbt helpers accordingly, and
    change rwb->sync_cookie to be the request pointer instead of the
    issue_stat pointer. No functional change.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

08 May, 2018

2 commits

  • Commit 9c40cef2b799 ("sched: Move blk_schedule_flush_plug() out of
    __schedule()") moved the blk_schedule_flush_plug() call out of the
    interrupt/preempt disabled region in the scheduler. This allows to replace
    local_irq_save/restore(flags) by local_irq_disable/enable() in
    blk_flush_plug_list().

    But it makes more sense to disable interrupts explicitly when the request
    queue is locked end reenable them when the request to is unlocked. This
    shortens the interrupt disabled section which is important when the plug
    list contains requests for more than one queue. The comment which claims
    that disabling interrupts around the loop is misleading as the called
    functions can reenable interrupts unconditionally anyway and obfuscates the
    scope badly:

    local_irq_save(flags);
    spin_lock(q->queue_lock);
    ...
    queue_unplugged(q...);
    scsi_request_fn();
    spin_unlock_irq(q->queue_lock);

    -------------------^^^ ????

    spin_lock_irq(q->queue_lock);
    spin_unlock(q->queue_lock);
    local_irq_restore(flags);

    Aside of that the detached interrupt disabling is a constant pain for
    PREEMPT_RT as it requires patching and special casing when RT is enabled
    while with the spin_*_irq() variants this happens automatically.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     
  • Commit 2fff8a924d4c ("block: Check locking assumptions at runtime") added a
    lockdep_assert_held(q->queue_lock) which makes the WARN_ON() redundant
    because lockdep will detect and warn about context violations.

    The unconditional WARN_ON() does not provide real additional value, so it
    can be removed.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Jens Axboe

    Anna-Maria Gleixner
     

17 Apr, 2018

1 commit

  • rq->gstate and rq->aborted_gstate both are zero before rqs are
    allocated. If we have a small timeout, when the timer fires,
    there could be rqs that are never allocated, and also there could
    be rq that has been allocated but not initialized and started. At
    the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
    the blk_mq_terminate_expired will identify the rq is timed out and
    invoke .timeout early.

    For scsi, this will cause scsi_times_out to be invoked before the
    scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
    the moment, then we will get crash.

    Cc: Bart Van Assche
    Cc: Tejun Heo
    Cc: Ming Lei
    Cc: Martin Steigerwald
    Cc: stable@vger.kernel.org
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

15 Apr, 2018

1 commit

  • When blk_queue_enter() waits for a queue to unfreeze, or unset the
    PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

    The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
    ("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
    device is resumed asynchronously, i.e. after un-freezing userspace tasks.

    So that commit exposed the bug as a regression in v4.15. A mysterious
    SIGBUS (or -EIO) sometimes happened during the time the device was being
    resumed. Most frequently, there was no kernel log message, and we saw Xorg
    or Xwayland killed by SIGBUS.[1]

    [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

    Without this fix, I get an IO error in this test:

    # dd if=/dev/sda of=/dev/null iflag=direct & \
    while killall -SIGUSR1 dd; do sleep 0.1; done & \
    echo mem > /sys/power/state ; \
    sleep 5; killall dd # stop after 5 seconds

    The interruptible wait was added to blk_queue_enter in
    commit 3ef28e83ab15 ("block: generic request_queue reference counting").
    Before then, the interruptible wait was only in blk-mq, but I don't think
    it could ever have been correct.

    Reviewed-by: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Alan Jenkins
    Signed-off-by: Jens Axboe

    Alan Jenkins
     

11 Apr, 2018

1 commit

  • Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
    it is no longer safe to access cgroup information during or after the
    blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
    call with blk_queue_enter() / blk_queue_exit().

    Reported-by: Ming Lei
    Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
    Signed-off-by: Bart Van Assche
    Cc: Ming Lei
    Cc: Joseph Qi
    Signed-off-by: Jens Axboe

    Bart Van Assche