12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Feb, 2018

1 commit

  • Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
    RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
    be re-inserted into the active I/O scheduler for that device. As a
    consequence, I/O schedulers may get the same request inserted again,
    even several times, without a finish_request invoked on that request
    before each re-insertion.

    This fact is the cause of the failure reported in [1]. For an I/O
    scheduler, every re-insertion of the same re-prepared request is
    equivalent to the insertion of a new request. For schedulers like
    mq-deadline or kyber, this fact causes no harm. In contrast, it
    confuses a stateful scheduler like BFQ, which keeps state for an I/O
    request, until the finish_request hook is invoked on the request. In
    particular, BFQ may get stuck, waiting forever for the number of
    request dispatches, of the same request, to be balanced by an equal
    number of request completions (while there will be one completion for
    that request). In this state, BFQ may refuse to serve I/O requests
    from other bfq_queues. The hang reported in [1] then follows.

    However, the above re-prepared requests undergo a requeue, thus the
    requeue_request hook of the active elevator is invoked for these
    requests, if set. This commit then addresses the above issue by
    properly implementing the hook requeue_request in BFQ.

    [1] https://marc.info/?l=linux-block&m=151211117608676

    Reported-by: Ivan Kozik
    Reported-by: Alban Browaeys
    Tested-by: Mike Galbraith
    Signed-off-by: Paolo Valente
    Signed-off-by: Serena Ziviani
    Signed-off-by: Jens Axboe

    Paolo Valente
     

07 Feb, 2018

2 commits

  • The classic error injection mechanism, should_fail_request() does not
    support use cases where more information is required (from the entire
    struct bio, for example).

    To that end, this patch introduces should_fail_bio(), which calls
    should_fail_request() under the hood but provides a convenient
    place for kprobes to hook into if they require the entire struct bio.
    This patch also replaces some existing calls to should_fail_request()
    with should_fail_bio() with no degradation in performance.

    Signed-off-by: Howard McLauchlan
    Signed-off-by: Jens Axboe

    Howard McLauchlan
     
  • Mikulas reported a workload that saw bad performance, and figured
    out what it was due to various other types of requests being
    accounted as reads. Flush requests, for instance. Due to the
    high latency of those, we heavily throttle the writes to keep
    the latencies in balance. But they really should be accounted
    as writes.

    Fix this by checking the exact type of the request. If it's a
    read, account as a read, if it's a write or a flush, account
    as a write. Any other request we disregard. Previously everything
    would have been mistakenly accounted as reads.

    Reported-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Feb, 2018

1 commit

  • Pull more block updates from Jens Axboe:
    "Most of this is fixes and not new code/features:

    - skd fix from Arnd, fixing a build error dependent on sla allocator
    type.

    - blk-mq scheduler discard merging fixes, one from me and one from
    Keith. This fixes a segment miscalculation for blk-mq-sched, where
    we mistakenly think two segments are physically contigious even
    though the request isn't carrying real data. Also fixes a bio-to-rq
    merge case.

    - Don't re-set a bit on the buffer_head flags, if it's already set.
    This can cause scalability concerns on bigger machines and
    workloads. From Kemi Wang.

    - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to
    distuingish between a local (device related) resource starvation
    and a global one. The latter might happen without IO being in
    flight, so it has to be handled a bit differently. From Ming"

    * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block:
    block: skd: fix incorrect linux/slab_def.h inclusion
    buffer: Avoid setting buffer bits that are already set
    blk-mq-sched: Enable merging discard bio into request
    blk-mq: fix discard merge with scheduler attached
    blk-mq: introduce BLK_STS_DEV_RESOURCE

    Linus Torvalds
     

02 Feb, 2018

2 commits

  • Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • I ran into an issue on my laptop that triggered a bug on the
    discard path:

    WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
    Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
    CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176
    Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
    RIP: 0010:nvme_setup_cmd+0x3d3/0x430
    RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
    RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
    RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
    RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
    R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
    R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
    FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    nvme_queue_rq+0x40/0xa00
    ? __sbitmap_queue_get+0x24/0x90
    ? blk_mq_get_tag+0xa3/0x250
    ? wait_woken+0x80/0x80
    ? blk_mq_get_driver_tag+0x97/0xf0
    blk_mq_dispatch_rq_list+0x7b/0x4a0
    ? deadline_remove_request+0x49/0xb0
    blk_mq_do_dispatch_sched+0x4f/0xc0
    blk_mq_sched_dispatch_requests+0x106/0x170
    __blk_mq_run_hw_queue+0x53/0xa0
    __blk_mq_delay_run_hw_queue+0x83/0xa0
    blk_mq_run_hw_queue+0x6c/0xd0
    blk_mq_sched_insert_request+0x96/0x140
    __blk_mq_try_issue_directly+0x3d/0x190
    blk_mq_try_issue_directly+0x30/0x70
    blk_mq_make_request+0x1a4/0x6a0
    generic_make_request+0xfd/0x2f0
    ? submit_bio+0x5c/0x110
    submit_bio+0x5c/0x110
    ? __blkdev_issue_discard+0x152/0x200
    submit_bio_wait+0x43/0x60
    ext4_process_freed_data+0x1cd/0x440
    ? account_page_dirtied+0xe2/0x1a0
    ext4_journal_commit_callback+0x4a/0xc0
    jbd2_journal_commit_transaction+0x17e2/0x19e0
    ? kjournald2+0xb0/0x250
    kjournald2+0xb0/0x250
    ? wait_woken+0x80/0x80
    ? commit_timeout+0x10/0x10
    kthread+0x111/0x130
    ? kthread_create_worker_on_cpu+0x50/0x50
    ? do_group_exit+0x3a/0xa0
    ret_from_fork+0x1f/0x30
    Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
    ---[ end trace 50d361cc444506c8 ]---
    print_req_error: I/O error, dev nvme0n1, sector 847167488

    Decoding the assembly, the request claims to have 0xffff segments,
    while nvme counts two. This turns out to be because we don't check
    for a data carrying request on the mq scheduler path, and since
    blk_phys_contig_segment() returns true for a non-data request,
    we decrement the initial segment count of 0 and end up with
    0xffff in the unsigned short.

    There are a few issues here:

    1) We should initialize the segment count for a discard to 1.
    2) The discard merging is currently using the data limits for
    segments and sectors.

    Fix this up by having attempt_merge() correctly identify the
    request, and by initializing the segment count correctly
    for discards.

    This can only be triggered with mq-deadline on discard capable
    devices right now, which isn't a common configuration.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Jan, 2018

2 commits

  • This status is returned from driver to block layer if device related
    resource is unavailable, but driver can guarantee that IO dispatch
    will be triggered in future when the resource is available.

    Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver
    returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
    a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is
    3 ms because both scsi-mq and nvmefc are using that magic value.

    If a driver can make sure there is in-flight IO, it is safe to return
    BLK_STS_DEV_RESOURCE because:

    1) If all in-flight IOs complete before examining SCHED_RESTART in
    blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
    is run immediately in this case by blk_mq_dispatch_rq_list();

    2) if there is any in-flight IO after/when examining SCHED_RESTART
    in blk_mq_dispatch_rq_list():
    - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
    - otherwise, this request will be dispatched after any in-flight IO is
    completed via blk_mq_sched_restart()

    3) if SCHED_RESTART is set concurently in context because of
    BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
    cases and make sure IO hang can be avoided.

    One invariant is that queue will be rerun if SCHED_RESTART is set.

    Suggested-by: Jens Axboe
    Tested-by: Laurence Oberman
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

30 Jan, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block IO related changes for the
    4.16 kernel. Nothing major in this pull request, but a good amount of
    improvements and fixes all over the map. This contains:

    - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
    Paolo.

    - Support for SMR zones for deadline and mq-deadline from Damien and
    Christoph.

    - Set of fixes for bcache by way of Michael Lyle, including fixes
    from himself, Kent, Rui, Tang, and Coly.

    - Series from Matias for lightnvm with fixes from Hans Holmberg,
    Javier, and Matias. Mostly centered around pblk, and the removing
    rrpc 1.2 in preparation for supporting 2.0.

    - A couple of NVMe pull requests from Christoph. Nothing major in
    here, just fixes and cleanups, and support for command tracing from
    Johannes.

    - Support for blk-throttle for tracking reads and writes separately.
    From Joseph Qi. A few cleanups/fixes also for blk-throttle from
    Weiping.

    - Series from Mike Snitzer that enables dm to register its queue more
    logically, something that's alwways been problematic on dm since
    it's a stacked device.

    - Series from Ming cleaning up some of the bio accessor use, in
    preparation for supporting multipage bvecs.

    - Various fixes from Ming closing up holes around queue mapping and
    quiescing.

    - BSD partition fix from Richard Narron, fixing a problem where we
    can't mount newer (10/11) FreeBSD partitions.

    - Series from Tejun reworking blk-mq timeout handling. The previous
    scheme relied on atomic bits, but it had races where we would think
    a request had timed out if it to reused at the wrong time.

    - null_blk now supports faking timeouts, to enable us to better
    exercise and test that functionality separately. From me.

    - Kill the separate atomic poll bit in the request struct. After
    this, we don't use the atomic bits on blk-mq anymore at all. From
    me.

    - sgl_alloc/free helpers from Bart.

    - Heavily contended tag case scalability improvement from me.

    - Various little fixes and cleanups from Arnd, Bart, Corentin,
    Douglas, Eryu, Goldwyn, and myself"

    * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
    block: remove smart1,2.h
    nvme: add tracepoint for nvme_complete_rq
    nvme: add tracepoint for nvme_setup_cmd
    nvme-pci: introduce RECONNECTING state to mark initializing procedure
    nvme-rdma: remove redundant boolean for inline_data
    nvme: don't free uuid pointer before printing it
    nvme-pci: Suspend queues after deleting them
    bsg: use pr_debug instead of hand crafted macros
    blk-mq-debugfs: don't allow write on attributes with seq_operations set
    nvme-pci: Fix queue double allocations
    block: Set BIO_TRACE_COMPLETION on new bio during split
    blk-throttle: use queue_is_rq_based
    block: Remove kblockd_schedule_delayed_work{,_on}()
    blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
    blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
    lib/scatterlist: Fix chaining support in sgl_alloc_order()
    blk-throttle: track read and write request individually
    block: add bdev_read_only() checks to common helpers
    block: fail op_is_write() requests to read-only partitions
    blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
    ...

    Linus Torvalds
     

25 Jan, 2018

2 commits

  • Use pr_debug instead of hand crafted macros. This way it is not needed to
    re-compile the kernel to enable bsg debug outputs and it's possible to
    selectively enable specific prints.

    Cc: Joe Perches
    Reviewed-by: Bart Van Assche
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Attributes that only implement .seq_ops are read-only, any write to
    them should be rejected. But currently kernel would crash when
    writing to such debugfs entries, e.g.

    chmod +w /sys/kernel/debug/block//requeue_list
    echo 0 > /sys/kernel/debug/block//requeue_list
    chmod -w /sys/kernel/debug/block//requeue_list

    Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
    such attributes.

    Cc: Ming Lei
    Signed-off-by: Eryu Guan
    Signed-off-by: Jens Axboe

    Eryu Guan
     

24 Jan, 2018

1 commit


20 Jan, 2018

4 commits


19 Jan, 2018

7 commits

  • In mixed read/write workload on SSD, write latency is much lower than
    read. But now we only track and record read latency and then use it as
    threshold base for both read and write io latency accounting. As a
    result, write io latency will always be considered as good and
    bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
    tg to be checked will be treated as idle most of the time and still let
    others dispatch more ios, even it is truly running under low limit and
    wants its low limit to be guaranteed, which is not we expected in fact.
    So track read and write request individually, which can bring more
    precise latency control for low limit idle detection.

    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     
  • Similar to blkdev_write_iter(), return -EPERM if the partition is
    read-only. This covers ioctl(), fallocate() and most in-kernel users
    but isn't meant to be exhaustive -- everything else will be caught in
    generic_make_request_checks(), fail with -EIO and can be fixed later.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • Regular block device writes go through blkdev_write_iter(), which does
    bdev_read_only(), while zeroout/discard/etc requests are never checked,
    both userspace- and kernel-triggered. Add a generic catch-all check to
    generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
    set_disk_ro(), which is used by quite a few drivers for things like
    snapshots, read-only backing files/images, etc.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • export these two interface for cgroup-v1.

    Acked-by: Tejun Heo
    Signed-off-by: weiping zhang
    Signed-off-by: Jens Axboe

    weiping zhang
     
  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch does not change any functionality.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • These two functions are only called from inside the block layer so
    unexport them.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

18 Jan, 2018

9 commits

  • To maximise responsiveness, BFQ raises the weight, and performs device
    idling, for bfq_queues associated with processes deemed as
    interactive. In particular, weight raising has a maximum duration,
    equal to the time needed to start a large application. If a
    weight-raised process goes on doing I/O beyond this maximum duration,
    it loses weight-raising.

    This mechanism is evidently vulnerable to the following false
    positives: I/O-bound applications that will go on doing I/O for much
    longer than the duration of weight-raising. These applications have
    basically no benefit from being weight-raised at the beginning of
    their I/O. On the opposite end, while being weight-raised, these
    applications
    a) unjustly steal throughput to applications that may truly need
    low latency;
    b) make BFQ uselessly perform device idling; device idling results
    in loss of device throughput with most flash-based storage, and may
    increase latencies when used purposelessly.

    This commit adds a countermeasure to reduce both the above
    problems. To introduce this countermeasure, we provide the following
    extra piece of information (full details in the comments added by this
    commit). During the start-up of the large application used as a
    reference to set the duration of weight-raising, involved processes
    transfer at most ~110K sectors each. Accordingly, a process initially
    deemed as interactive has no right to be weight-raised any longer,
    once transferred 110K sectors or more.

    Basing on this consideration, this commit early-ends weight-raising
    for a bfq_queue if the latter happens to have received an amount of
    service at least equal to 110K sectors (actually, a little bit more,
    to keep a safety margin). I/O-bound applications that reach a high
    throughput, such as file copy, get to this threshold much before the
    allowed weight-raising period finishes. Thus this early ending of
    weight-raising reduces the amount of time during which these
    applications cause the problems described above.

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Asynchronous I/O can easily starve synchronous I/O (both sync reads
    and sync writes), by consuming all request tags. Similarly, storms of
    synchronous writes, such as those that sync(2) may trigger, can starve
    synchronous reads. In their turn, these two problems may also cause
    BFQ to loose control on latency for interactive and soft real-time
    applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
    Writer takes 0.6 seconds to start if the device is idle, but it takes
    more than 45 seconds (!) if there are sequential writes in the
    background.

    This commit addresses this issue by limiting the maximum percentage of
    tags that asynchronous I/O requests and synchronous write requests can
    consume. In particular, this commit grants a higher threshold to
    synchronous writes, to prevent the latter from being starved by
    asynchronous I/O.

    According to the above test, LibreOffice Writer now starts in about
    1.2 seconds on average, regardless of the background workload, and
    apart from some rare outlier. To check this improvement, run, e.g.,
    sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
    for the comm_startup_lat benchmark in the S suite [1].

    [1] https://github.com/Algodev-github/S

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If we run into blk_mq_request_direct_issue(), when queue is busy, we
    don't want to dispatch this request into hctx->dispatch_list, and
    what we need to do is to return the queue busy info to caller, so
    that caller can deal with it well.

    Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
    Reported-by: Laurence Oberman
    Reviewed-by: Mike Snitzer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • After commit:

    923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq")

    we no longer use the 'can_block' argument in
    blk_mq_sched_insert_request(). Kill it.

    Signed-off-by: Mike Snitzer

    Added actual commit message as to why it's being removed.

    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • blk_insert_cloned_request() is called in the fast path of a dm-rq driver
    (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
    blk_mq_request_bypass_insert() to directly append the request to the
    blk-mq hctx->dispatch_list of the underlying queue.

    1) This way isn't efficient enough because the hctx spinlock is always
    used.

    2) With blk_insert_cloned_request(), we completely bypass underlying
    queue's elevator and depend on the upper-level dm-rq driver's elevator
    to schedule IO. But dm-rq currently can't get the underlying queue's
    dispatch feedback at all. Without knowing whether a request was issued
    or not (e.g. due to underlying queue being busy) the dm-rq elevator will
    not be able to provide effective IO merging (as a side-effect of dm-rq
    currently blindly destaging a request from its elevator only to requeue
    it after a delay, which kills any opportunity for merging). This
    obviously causes very bad sequential IO performance.

    Fix this by updating blk_insert_cloned_request() to use
    blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
    request to be issued directly to the underlying queue and returns the
    dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
    returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
    to _not_ destage the request. Whereby preserving the opportunity to
    merge IO.

    With this, request-based DM's blk-mq sequential IO performance is vastly
    improved (as much as 3X in mpath/virtio-scsi testing).

    Signed-off-by: Ming Lei
    [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
    they were refactored to make them less fragile and easier to read/review]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • No functional change. Just makes code flow more logically.

    In following commit, __blk_mq_try_issue_directly() will be used to
    return the dispatch result (blk_status_t) to DM. DM needs this
    information to improve IO merging.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • We know this WARN_ON is harmless and in reality it may be trigged,
    so convert it to printk() and dump_stack() to avoid to confusing
    people.

    Also add comment about two releated races here.

    Cc: Christian Borntraeger
    Cc: Stefan Haberland
    Cc: Christoph Hellwig
    Cc: Thomas Gleixner
    Cc: "jianchao.wang"
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When hctx->next_cpu is set from possible online CPUs, there is one
    race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
    break workqueue.

    The race can be triggered in the following two sitations:

    1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
    to dispatch requests from the DEAD cpu context, but at that
    time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
    CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
    a bad value.

    2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
    should be run on the other CPU A, then CPU A may become offline at the
    same time and all CPUs in hctx->cpumask become offline.

    This patch deals with this issue by re-selecting next CPU, and making
    sure it is set correctly.

    Cc: Christian Borntraeger
    Cc: Stefan Haberland
    Cc: Christoph Hellwig
    Cc: Thomas Gleixner
    Reported-by: "jianchao.wang"
    Tested-by: "jianchao.wang"
    Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

15 Jan, 2018

5 commits

  • During stress tests by syzkaller on the sg driver the block layer
    infrequently returns EINVAL. Closer inspection shows the block
    layer was trying to return ENOMEM (which is much more
    understandable) but for some reason overroad that useful error.

    Patch below does not show this (unchanged) line:
    ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
    That 'ret' was being overridden when that function failed.

    Signed-off-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    Douglas Gilbert
     
  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • device_add_disk() will only call bdi_register_owner() if
    !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
    bdi_unregister() if !GENHD_FL_HIDDEN.

    Found with code inspection. bdi_unregister() won't do any harm if
    bdi_register_owner() wasn't used but best to avoid the unnecessary
    call to bdi_unregister().

    Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • A previous commit moved the clearing of rq->rq_flags later,
    but we may have already set RQF_MQ_INFLIGHT when that happens.
    Ensure that we correctly initialize rq->rq_flags to the
    right value.

    This is based on an original fix by Ming, just rewritten to not
    require a conditional.

    Fixes: 7c3fb70f0341 ("block: rearrange a few request fields for better cache layout")
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Jan, 2018

2 commits

  • Looking at debug output, we see:

    ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
    18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
    =0, .tag=-1, .internal_tag=217}

    Note the '1' between SORTED and SOFTBARRIER - that's because no name
    as defined for RQF_STARTED. Fixed that.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The previous patch assigns interrupt vectors to all possible CPUs, so
    now hctx can be mapped to possible CPUs, this patch applies this fact
    to simplify queue mapping & schedule so that we don't need to handle
    CPU hotplug for dealing with physical CPU plug & unplug. With this
    simplication, we can work well on physical CPU plug & unplug, which
    is a normal use case for VM at least.

    Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
    set hctx->numa_node for possible CPUs which are mapped to this hctx. And
    only choose the online CPUs for schedule.

    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Tested-by: Stefan Haberland
    Cc: Thomas Gleixner
    Signed-off-by: Christoph Hellwig
    Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
    (merged the three into one because any single one may not work, and fix
    selecting online CPUs for scheduler)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig