01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Sep, 2018

5 commits

  • Instead of allowing requests that are not power management requests
    to enter the queue in runtime suspended status (RPM_SUSPENDED), make
    the blk_get_request() caller block. This change fixes a starvation
    issue: it is now guaranteed that power management requests will be
    executed no matter how many blk_get_request() callers are waiting.
    For blk-mq, instead of maintaining the q->nr_pending counter, rely
    on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
    request finishes instead of only if the queue depth drops to zero.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of scheduling runtime resume of a request queue after a
    request has been queued, schedule asynchronous resume during request
    allocation. The new pm_request_resume() calls occur after
    blk_queue_enter() has increased the q_usage_counter request queue
    member. This change is needed for a later patch that will make request
    allocation block while the queue status is not RPM_ACTIVE.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
    two new functions and thereby separate legacy block layer code from code
    that works for both the legacy block layer and blk-mq. A later patch will
    add calls to the new functions in the blk-mq code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • The RQF_PREEMPT flag is used for three purposes:
    - In the SCSI core, for making sure that power management requests
    are executed even if a device is in the "quiesced" state.
    - For domain validation by SCSI drivers that use the parallel port.
    - In the IDE driver, for IDE preempt requests.
    Rename "preempt-only" into "pm-only" because the primary purpose of
    this mode is power management. Since the power management core may
    but does not have to resume a runtime suspended device before
    performing system-wide suspend and since a later patch will set
    "pm-only" mode as long as a block device is runtime suspended, make
    it possible to set "pm-only" mode from more than one context. Since
    with this change scsi_device_quiesce() is no longer idempotent, make
    that function return early if it is called for a quiesced queue.

    Signed-off-by: Bart Van Assche
    Acked-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Cc: Jianchao Wang
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the code for runtime power management from blk-core.c into the
    new source file blk-pm.c. Move the corresponding declarations from
    into . For CONFIG_PM=n, leave out
    the declarations of the functions that are not used in that mode.
    This patch not only reduces the number of #ifdefs in the block layer
    core code but also reduces the size of header file
    and hence should help to reduce the build time of the Linux kernel
    if CONFIG_PM is not defined.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

22 Sep, 2018

1 commit

  • Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
    updating properly on 4.18. This is because we started using ktime to
    track elapsed time, and we convert nanoseconds to jiffies when we update
    the partition counter. However, this gets rounded down, so any I/Os that
    take less than a jiffy are not accounted for. Previously in this case,
    the value of jiffies would sometimes increment while we were doing I/O,
    so at least some I/Os were accounted for.

    Let's convert the stats to use nanoseconds internally. We still report
    milliseconds as before, now more accurately than ever. The value is
    still truncated to 32 bits for backwards compatibility.

    Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
    Cc: stable@vger.kernel.org
    Reported-by: Klaus Kusche
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

06 Sep, 2018

1 commit

  • It is possible to call fsync on a read-only handle (for example, fsck.ext2
    does it when doing read-only check), and this call results in kernel
    warning.

    The patch b089cfd95d32 ("block: don't warn for flush on read-only device")
    attempted to disable the warning, but it is buggy and it doesn't
    (op_is_flush tests flags, but bio_op strips off the flags).

    Signed-off-by: Mikulas Patocka
    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Cc: stable@vger.kernel.org # 4.18
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

23 Aug, 2018

1 commit

  • Pull more block updates from Jens Axboe:

    - Set of bcache fixes and changes (Coly)

    - The flush warn fix (me)

    - Small series of BFQ fixes (Paolo)

    - wbt hang fix (Ming)

    - blktrace fix (Steven)

    - blk-mq hardware queue count update fix (Jianchao)

    - Various little fixes

    * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
    block/DAC960.c: make some arrays static const, shrinks object size
    blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
    blk-mq: init hctx sched after update ctx and hctx mapping
    block: remove duplicate initialization
    tracing/blktrace: Fix to allow setting same value
    pktcdvd: fix setting of 'ret' error return for a few cases
    block: change return type to bool
    block, bfq: return nbytes and not zero from struct cftype .write() method
    block, bfq: improve code of bfq_bfqq_charge_time
    block, bfq: reduce write overcharge
    block, bfq: always update the budget of an entity when needed
    block, bfq: readd missing reset of parent-entity service
    blk-wbt: fix IO hang in wbt_wait()
    block: don't warn for flush on read-only device
    bcache: add the missing comments for smp_mb()/smp_wmb()
    bcache: remove unnecessary space before ioctl function pointer arguments
    bcache: add missing SPDX header
    bcache: move open brace at end of function definitions to next line
    bcache: add static const prefix to char * array declarations
    bcache: fix code comments style
    ...

    Linus Torvalds
     

18 Aug, 2018

1 commit

  • This patch removes the duplicate initialization of q->queue_head
    in the blk_alloc_queue_node(). This removes the 2nd initialization
    so that we preserve the initialization order same as declaration
    present in struct request_queue.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

15 Aug, 2018

2 commits

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     
  • Don't warn for a flush issued to a read-only device. It's not strictly
    a writable command, as it doesn't change any on-media data by itself.

    Reported-by: Stefan Agner
    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Aug, 2018

1 commit


05 Aug, 2018

1 commit

  • It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
    requests to read-only partitions"), while obviously correct, causes
    problems for some older lvm2 installations.

    The reason is that the lvm snapshotting will continue to write to the
    snapshow COW volume, even after the volume has been marked read-only.
    End result: snapshot failure.

    This has actually been fixed in newer version of the lvm2 tool, but the
    old tools still exist, and the breakage was reported both in the kernel
    bugzilla and in the Debian bugzilla:

    https://bugzilla.kernel.org/show_bug.cgi?id=200439
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

    The lvm2 fix is here

    https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

    but until everybody has updated to recent versions, we'll have to weaken
    the "never write to read-only partitions" check. It now allows the
    write to happen, but causes a warning, something like this:

    generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
    Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
    CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
    Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
    Workqueue: ksnaphd do_metadata
    RIP: 0010:generic_make_request_checks+0x4ac/0x600
    ...
    Call Trace:
    generic_make_request+0x64/0x400
    submit_bio+0x6c/0x140
    dispatch_io+0x287/0x430
    sync_io+0xc3/0x120
    dm_io+0x1f8/0x220
    do_metadata+0x1d/0x30
    process_one_work+0x1b9/0x3e0
    worker_thread+0x2b/0x3c0
    kthread+0x113/0x130
    ret_from_fork+0x35/0x40

    Note that this is a "revert" in behavior only. I'm leaving alone the
    actual code cleanups in commit 721c7fc701c7, but letting the previously
    uncaught request go through with a warning instead of stopping it.

    Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
    Reported-and-tested-by: WGH
    Acked-by: Mike Snitzer
    Cc: Sagi Grimberg
    Cc: Ilya Dryomov
    Cc: Jens Axboe
    Cc: Zdenek Kabelac
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Aug, 2018

1 commit

  • Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
    disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
    it can't take effect in that way since user space still can switch
    it on via 'echo auto > /sys/block/sdN/device/power/control'.

    This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
    and fixes all kinds of PM related kernel crash.

    Cc: Tomas Janousek
    Cc: Przemek Socha
    Cc: Alan Stern
    Cc:
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Tested-by: Patrick Steinhardt
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Jul, 2018

1 commit

  • We find the memory use-after-free issue in __blk_drain_queue()
    on the kernel 4.14. After read the latest kernel 4.18-rc6 we
    think it has the same problem.

    Memory is allocated for q->fq in the blk_init_allocated_queue().
    If the elevator init function called with error return, it will
    run into the fail case to free the q->fq.

    Then the __blk_drain_queue() uses the same memory after the free
    of the q->fq, it will lead to the unpredictable event.

    The patch is to set q->fq as NULL in the fail case of
    blk_init_allocated_queue().

    Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
    Cc:
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: xiao jin
    Signed-off-by: Jens Axboe

    xiao jin
     

18 Jul, 2018

1 commit

  • Add and use a new op_stat_group() function for indexing partition stat
    fields rather than indexing them by rq_data_dir() or bio_data_dir().
    This function works similarly to op_is_sync() in that it takes the
    request::cmd_flags or bio::bi_opf flags and determines which stats
    should et updated.

    In addition, the second parameter to generic_start_io_acct() and
    generic_end_io_acct() is now a REQ_OP rather than simply a read or
    write bit and it uses op_stat_group() on the parameter to determine
    the stat group.

    Note that the partition in_flight counts are not part of the per-cpu
    statistics and as such are not indexed via this function. It's now
    indexed by op_is_write().

    tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Joshua Morris
    Cc: Philipp Reisner
    Cc: Matias Bjorling
    Cc: Kent Overstreet
    Cc: Alasdair Kergon
    Signed-off-by: Jens Axboe

    Michael Callahan
     

09 Jul, 2018

5 commits

  • With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
    switch (rq->q->rpm_status) {
    ^

    Convert the return statement below the switch() block into a default
    case to fix this.

    Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     
  • We don't really need to save this stuff in the core block code, we can
    just pass the bio back into the helpers later on to derive the same
    flags and update the rq->wbt_flags appropriately.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • The payload of struct request is stored in the request.bio chain if
    the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
    RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
    iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
    set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
    blk_rq_bytes() which means that the value returned by that function
    is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
    clear to me whether this is an oversight or whether this happened on
    purpose. Anyway, document that it is known that both functions ignore
    RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b988 ("block: improve
    handling of the magic discard payload").

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • SCSI probing may synchronously create and destroy a lot of request_queues
    for non-existent devices. Any synchronize_rcu() in queue creation or
    destroy path may introduce long latency during booting, see detailed
    description in comment of blk_register_queue().

    This patch removes one synchronize_rcu() inside blk_cleanup_queue()
    for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
    needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
    when queue isn't initialized, it isn't necessary to do that since
    only pass-through requests are involved, no original issue in
    scsi_execute() at all.

    Without this patch and previous one, it may take more 20+ seconds for
    virtio-scsi to complete disk probe. With the two patches, the time becomes
    less than 100ms.

    Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
    Reported-by: Andrew Jones
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Jun, 2018

1 commit

  • This patch avoids that removing a path controlled by the dm-mpath driver
    while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:

    dm_softirq_done+0x326/0x3d0 [dm_mod]
    blk_done_softirq+0x19b/0x1e0
    __do_softirq+0x128/0x60d
    irq_exit+0x100/0x110
    smp_call_function_single_interrupt+0x90/0x330
    call_function_single_interrupt+0xf/0x20

    Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Jun, 2018

1 commit

  • Commit 0ba99ca4838b ("block: Add warning for bi_next not NULL in
    bio_endio()") breaks the dm driver. end_clone_bio() detects whether
    or not a bio is the last bio associated with a request by checking
    the .bi_next field. Commit 0ba99ca4838b clears that field before
    end_clone_bio() has had a chance to inspect that field. Hence revert
    commit 0ba99ca4838b.

    This patch avoids that KASAN reports the following complaint when
    running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):

    ==================================================================
    BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
    Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9

    CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0xa4/0xf5
    print_address_description+0x6f/0x270
    kasan_report+0x241/0x360
    __asan_load4+0x78/0x80
    bio_advance+0x11b/0x1d0
    blk_update_request+0xa7/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d
    run_ksoftirqd+0x3f/0x60
    smpboot_thread_fn+0x352/0x460
    kthread+0x1c1/0x1e0
    ret_from_fork+0x24/0x30

    Allocated by task 1918:
    save_stack+0x43/0xd0
    kasan_kmalloc+0xad/0xe0
    kasan_slab_alloc+0x11/0x20
    kmem_cache_alloc+0xfe/0x350
    mempool_alloc_slab+0x15/0x20
    mempool_alloc+0xfb/0x270
    bio_alloc_bioset+0x244/0x350
    submit_bh_wbc+0x9c/0x2f0
    __block_write_full_page+0x299/0x5a0
    block_write_full_page+0x16b/0x180
    blkdev_writepage+0x18/0x20
    __writepage+0x42/0x80
    write_cache_pages+0x376/0x8a0
    generic_writepages+0xbe/0x110
    blkdev_writepages+0xe/0x10
    do_writepages+0x9b/0x180
    __filemap_fdatawrite_range+0x178/0x1c0
    file_write_and_wait_range+0x59/0xc0
    blkdev_fsync+0x46/0x80
    vfs_fsync_range+0x66/0x100
    do_fsync+0x3d/0x70
    __x64_sys_fsync+0x21/0x30
    do_syscall_64+0x77/0x230
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 9:
    save_stack+0x43/0xd0
    __kasan_slab_free+0x137/0x190
    kasan_slab_free+0xe/0x10
    kmem_cache_free+0xd3/0x380
    mempool_free_slab+0x17/0x20
    mempool_free+0x63/0x160
    bio_free+0x81/0xa0
    bio_put+0x59/0x60
    end_bio_bh_io_sync+0x5d/0x70
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    end_clone_bio+0xa3/0xd0 [dm_mod]
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d

    The buggy address belongs to the object at ffff8801300e0640
    which belongs to the cache bio-0 of size 200
    The buggy address is located 144 bytes inside of
    200-byte region [ffff8801300e0640, ffff8801300e0708)
    The buggy address belongs to the page:
    page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
    flags: 0x8000000000008100(slab|head)
    raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
    raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
    ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    >ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    Cc: Kent Overstreet
    Fixes: 0ba99ca4838b ("block: Add warning for bi_next not NULL in bio_endio()")
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

07 Jun, 2018

1 commit

  • blk_partition_remap() will only clear bi_partno if an actual remapping
    has happened. But flush request et al don't have an actual size, so
    the remapping doesn't happen and bi_partno is never cleared.
    So for stacked devices blk_partition_remap() will be called on each level.
    If (as is the case for native nvme multipathing) one of the lower-level
    devices do _not_support partitioning a spurious I/O error is generated.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

03 Jun, 2018

1 commit

  • If we end up splitting a bio and the queue goes away between
    the initial submission and the later split submission, then we
    can block forever in blk_queue_enter() waiting for the reference
    to drop to zero. This will never happen, since we already hold
    a reference.

    Mark a split bio as already having entered the queue, so we can
    just use the live non-blocking queue enter variant.

    Thanks to Tetsuo Handa for the analysis.

    Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jun, 2018

3 commits


31 May, 2018

1 commit


29 May, 2018

1 commit

  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

15 May, 2018

2 commits


14 May, 2018

4 commits


09 May, 2018

3 commits

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • issue_stat is going to go away, so first make writeback throttling take
    the containing request, update the internal wbt helpers accordingly, and
    change rwb->sync_cookie to be the request pointer instead of the
    issue_stat pointer. No functional change.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval