21 Sep, 2016

1 commit


21 Jul, 2016

1 commit


13 Apr, 2016

1 commit

  • Add an internal helper and flag for setting whether a queue has
    write back caching, or write through (or none). Add a sysfs file
    to show this as well, and make it changeable from user space.

    This will replace the (awkward) blk_queue_flush() interface that
    drivers currently use to inform the block layer of write cache state
    and capabilities.

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Feb, 2016

1 commit


04 Dec, 2015

1 commit


26 Nov, 2015

1 commit

  • Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
    had the unfortunate side-effect of removing an implicit clamp to
    BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
    code. This caused problems for some SMR drives.

    Debugging this issue revealed a few problems with the existing
    infrastructure since the block layer didn't know how to deal with
    device-imposed limits, only limits set by the I/O controller.

    - Introduce a new queue limit, max_dev_sectors, which is used by the
    ULD to signal the maximum sectors for a REQ_TYPE_FS request.

    - Ensure that max_dev_sectors is correctly stacked and taken into
    account when overriding max_sectors through sysfs.

    - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
    values for later processing.

    - In sd_revalidate() set the queue's max_dev_sectors based on the
    MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
    is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
    field size.

    - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
    VPD--if reported and sane--to signal the preferred device transfer
    size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

    - blk_limits_max_hw_sectors() is no longer used and can be removed.

    Signed-off-by: Martin K. Petersen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
    Reviewed-by: Christoph Hellwig
    Tested-by: sweeneygj@gmx.com
    Tested-by: Arzeets
    Tested-by: David Eisner
    Tested-by: Mario Kicherer
    Signed-off-by: Martin K. Petersen

    Martin K. Petersen
     

08 Nov, 2015

1 commit

  • Add basic support for polling for specific IO to complete. This uses
    the cookie that blk-mq passes back, which enables the block layer
    to pass this cookie to the driver to spin for a specific request.

    This will be combined with request latency tracking, so we can make
    qualified decisions about when to poll and when not to. For now, for
    benchmark purposes, we add a sysfs file that controls whether polling
    is enabled or not.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     

05 Nov, 2015

1 commit

  • Pull block integrity updates from Jens Axboe:
    ""This is the joint work of Dan and Martin, cleaning up and improving
    the support for block data integrity"

    * 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
    block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
    block: blk_flush_integrity() for bio-based drivers
    block: move blk_integrity to request_queue
    block: generic request_queue reference counting
    nvme: suspend i/o during runtime blk_integrity_unregister
    md: suspend i/o during runtime blk_integrity_unregister
    md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
    block: Inline blk_integrity in struct gendisk
    block: Export integrity data interval size in sysfs
    block: Reduce the size of struct blk_integrity
    block: Consolidate static integrity profile properties
    block: Move integrity kobject to struct gendisk

    Linus Torvalds
     

22 Oct, 2015

1 commit

  • Allow pmem, and other synchronous/bio-based block drivers, to fallback
    on a per-cpu reference count managed by the core for tracking queue
    live/dead state.

    The existing per-cpu reference count for the blk_mq case is promoted to
    be used in all block i/o scenarios. This involves initializing it by
    default, waiting for it to drop to zero at exit, and holding a live
    reference over the invocation of q->make_request_fn() in
    generic_make_request(). The blk_mq code continues to take its own
    reference per blk_mq request and retains the ability to freeze the
    queue, but the check that the queue is frozen is moved to
    generic_make_request().

    This fixes crash signatures like the following:

    BUG: unable to handle kernel paging request at ffff880140000000
    [..]
    Call Trace:
    [] ? copy_user_handle_tail+0x5f/0x70
    [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
    [] pmem_make_request+0xd1/0x200 [nd_pmem]
    [] ? mempool_alloc+0x72/0x1a0
    [] generic_make_request+0xd6/0x110
    [] submit_bio+0x76/0x170
    [] submit_bh_wbc+0x12f/0x160
    [] submit_bh+0x12/0x20
    [] jbd2_write_superblock+0x8d/0x170
    [] jbd2_mark_journal_empty+0x5d/0x90
    [] jbd2_journal_destroy+0x24b/0x270
    [] ? put_pwq_unlocked+0x2a/0x30
    [] ? destroy_workqueue+0x225/0x250
    [] ext4_put_super+0x64/0x360
    [] generic_shutdown_super+0x6a/0xf0

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

15 Oct, 2015

1 commit

  • bdi's are initialized in two steps, bdi_init() and bdi_register(), but
    destroyed in a single step by bdi_destroy() which, for a bdi embedded
    in a request_queue, is called during blk_cleanup_queue() which makes
    the queue invisible and starts the draining of remaining usages.

    A request_queue's user can access the congestion state of the embedded
    bdi as long as it holds a reference to the queue. As such, it may
    access the congested state of a queue which finished
    blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
    Because the congested state was embedded in backing_dev_info which in
    turn is embedded in request_queue, accessing the congested state after
    bdi_destroy() was called was fine. The bdi was destroyed but the
    memory region for the congested state remained accessible till the
    queue got released.

    a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
    bdi_writeback") changed the situation. Now, the root congested state
    which is expected to be pinned while request_queue remains accessible
    is separately reference counted and the base ref is put during
    bdi_destroy(). This means that the root congested state may go away
    prematurely while the queue is between bdi_dstroy() and
    blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

    The root cause of this problem is that bdi doesn't distinguish the two
    steps of destruction, unregistration and release, and now the root
    congested state actually requires a separate release step. To fix the
    issue, this patch separates out bdi_unregister() and bdi_exit() from
    bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
    and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
    simple wrapper calling the two steps back-to-back.

    While at it, the prototype of bdi_destroy() is moved right below
    bdi_setup_and_register() so that the counterpart operations are
    located together.

    Signed-off-by: Tejun Heo
    Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
    Cc: stable@vger.kernel.org # v4.2+
    Reported-and-tested-by: Andrey Konovalov
    Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
    Reviewed-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Aug, 2015

1 commit

  • The way the block layer is currently written, it goes to great lengths
    to avoid having to split bios; upper layer code (such as bio_add_page())
    checks what the underlying device can handle and tries to always create
    bios that don't need to be split.

    But this approach becomes unwieldy and eventually breaks down with
    stacked devices and devices with dynamic limits, and it adds a lot of
    complexity. If the block layer could split bios as needed, we could
    eliminate a lot of complexity elsewhere - particularly in stacked
    drivers. Code that creates bios can then create whatever size bios are
    convenient, and more importantly stacked drivers don't have to deal with
    both their own bio size limitations and the limitations of the
    (potentially multiple) devices underneath them. In the future this will
    let us delete merge_bvec_fn and a bunch of other code.

    We do this by adding calls to blk_queue_split() to the various
    make_request functions that need it - a few can already handle arbitrary
    size bios. Note that we add the call _after_ any call to
    blk_queue_bounce(); this means that blk_queue_split() and
    blk_recalc_rq_segments() don't need to be concerned with bouncing
    affecting segment merging.

    Some make_request_fn() callbacks were simple enough to audit and verify
    they don't need blk_queue_split() calls. The skipped ones are:

    * nfhd_make_request (arch/m68k/emu/nfblock.c)
    * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
    * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
    * brd_make_request (ramdisk - drivers/block/brd.c)
    * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
    * loop_make_request
    * null_queue_bio
    * bcache's make_request fns

    Some others are almost certainly safe to remove now, but will be left
    for future patches.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Ming Lei
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Lars Ellenberg
    Cc: drbd-user@lists.linbit.com
    Cc: Jiri Kosina
    Cc: Geoff Levand
    Cc: Jim Paris
    Cc: Philip Kelleher
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Oleg Drokin
    Cc: Andreas Dilger
    Acked-by: NeilBrown (for the 'md/md.c' bits)
    Acked-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Kent Overstreet
    [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

17 Jul, 2015

1 commit

  • Lots of devices support huge discard sizes these days. Depending
    on how the device handles them internally, huge discards can
    introduce massive latencies (hundreds of msec) on the device side.

    We have a sysfs file, discard_max_bytes, that advertises the max
    hardware supported discard size. Make this writeable, and split
    the settings into a soft and hard limit. This can be set from
    'discard_granularity' and up to the hardware limit.

    Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
    set limit.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

02 Jun, 2015

2 commits

  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup aware writeback support will require exposing some of blkcg
    details. In preprataion, move block/blk-cgroup.h to
    include/linux/blk-cgroup.h. This patch is pure file move.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Apr, 2015

1 commit

  • Because of the peculiar way that md devices are created (automatically
    when the device node is opened), a new device can be created and
    registered immediately after the
    blk_unregister_region(disk_devt(disk), disk->minors);
    call in del_gendisk().

    Therefore it is important that all visible artifacts of the previous
    device are removed before this call. In particular, the 'bdi'.

    Since:
    commit c4db59d31e39ea067c32163ac961e9c80198fd37
    Author: Christoph Hellwig
    fs: don't reassign dirty inodes to default_backing_dev_info

    moved the
    device_unregister(bdi->dev);
    call from bdi_unregister() to bdi_destroy() it has been quite easy to
    lose a race and have a new (e.g.) "md127" be created after the
    blk_unregister_region() call and before bdi_destroy() is ultimately
    called by the final 'put_disk', which must come after del_gendisk().

    The new device finds that the bdi name is already registered in sysfs
    and complains

    > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
    > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

    We can fix this by moving the bdi_destroy() call out of
    blk_release_queue() (which can happen very late when a refcount
    reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
    device driver calls it.

    Then it is only necessary for md to call blk_cleanup_queue() before
    del_gendisk(). As loop.c devices are also created on demand by
    opening the device node, we make the same change there.

    Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
    Reported-by: Azat Khuzhin
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

30 Jan, 2015

1 commit

  • The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
    before the kobject is released because driver core can access it freely
    before its release.

    We can't do that in all ctx/hctx/mq_kobj's release handler because
    it can be run before blk_cleanup_queue().

    Given mq_kobj shouldn't have been introduced, this patch simply moves
    mq's release into blk_release_queue().

    Reported-by: Sasha Levin
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Dec, 2014

1 commit

  • blk-mq users are allowed to free the memory request_queue.tag_set
    points at after blk_cleanup_queue() has finished but before
    blk_release_queue() has started. This can happen e.g. in the SCSI
    core. The SCSI core namely embeds the tag_set structure in a SCSI
    host structure. The SCSI host structure is freed by
    scsi_host_dev_release(). This function is called after
    blk_cleanup_queue() finished but can be called before
    blk_release_queue().

    This means that it is not safe to access request_queue.tag_set from
    inside blk_release_queue(). Hence remove the blk_sync_queue() call
    from blk_release_queue(). This call is not necessary - outstanding
    requests must have finished before blk_release_queue() is
    called. Additionally, move the blk_mq_free_queue() call from
    blk_release_queue() to blk_cleanup_queue() to avoid that struct
    request_queue.tag_set gets accessed after it has been freed.

    This patch avoids that the following kernel oops can be triggered
    when deleting a SCSI host for which scsi-mq was enabled:

    Call Trace:
    [] lock_acquire+0xc4/0x270
    [] mutex_lock_nested+0x61/0x380
    [] blk_mq_free_queue+0x30/0x180
    [] blk_release_queue+0x84/0xd0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] blk_put_queue+0x15/0x20
    [] disk_release+0x99/0xd0
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_disk+0x1a/0x20
    [] __blkdev_put+0x135/0x1b0
    [] blkdev_put+0x50/0x160
    [] kill_block_super+0x44/0x70
    [] deactivate_locked_super+0x44/0x60
    [] deactivate_super+0x4e/0x70
    [] cleanup_mnt+0x43/0x90
    [] __cleanup_mnt+0x12/0x20
    [] task_work_run+0xac/0xe0
    [] do_notify_resume+0x61/0xa0
    [] int_signal+0x12/0x17

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

26 Sep, 2014

3 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Sep, 2014

1 commit

  • blk-mq uses percpu_ref for its usage counter which tracks the number
    of in-flight commands and used to synchronously drain the queue on
    freeze. percpu_ref shutdown takes measureable wallclock time as it
    involves a sched RCU grace period. This means that draining a blk-mq
    takes measureable wallclock time. One would think that this shouldn't
    matter as queue shutdown should be a rare event which takes place
    asynchronously w.r.t. userland.

    Unfortunately, SCSI probing involves synchronously setting up and then
    tearing down a lot of request_queues back-to-back for non-existent
    LUNs. This means that SCSI probing may take above ten seconds when
    scsi-mq is used.

    [ 0.949892] scsi host0: Virtio SCSI HBA
    [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
    [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
    [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

    [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
    [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
    [ 16.194099] osd: LOADED open-osd 0.2.1
    [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
    [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
    [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
    [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
    [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

    This is also the reason why request_queues start in bypass mode which
    is ended on blk_register_queue() as shutting down a fully functional
    queue also involves a RCU grace period and the queues for non-existent
    SCSI devices never reach registration.

    blk-mq basically needs to do the same thing - start the mq in a
    degraded mode which is faster to shut down and then make it fully
    functional only after the queue reaches registration. percpu_ref
    recently grew facilities to force atomic operation until explicitly
    switched to percpu mode, which can be used for this purpose. This
    patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
    switch it to percpu mode only once blk_register_queue() is reached.

    Note that this issue was previously worked around by 0a30288da1ae
    ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
    probe") for v3.17. The temp fix was reverted in preparation of adding
    persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq,
    percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
    This patch and the prerequisite percpu_ref changes will be merged
    during v3.18 devel cycle.

    Signed-off-by: Tejun Heo
    Reported-by: Christoph Hellwig
    Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
    Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
    Reviewed-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Johannes Weiner

    Tejun Heo
     

10 Sep, 2014

1 commit

  • When a queue is registered, the block layer turns off the bypass
    setting (because bypass is enabled when the queue is created). This
    doesn't work well for queues that are unregistered and then registered
    again; we get a WARNING because of the unbalanced calls to
    blk_queue_bypass_end().

    This patch fixes the problem by making blk_register_queue() call
    blk_queue_bypass_end() only the first time the queue is registered.

    Signed-off-by: Alan Stern
    Acked-by: Tejun Heo
    CC: James Bottomley
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Alan Stern
     

02 Jul, 2014

1 commit

  • Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
    skip queue draining if bypass_depth was already above zero. The
    assumption is that the one which bumped the bypass_depth should have
    performed draining already; however, there's nothing which prevents a
    new instance of bypassing/freezing from starting before the previous
    one finishes draining. The current code may allow the later
    bypassing/freezing instances to complete while there still are
    in-flight requests which haven't finished draining.

    Fix it by draining regardless of bypass_depth. We still skip draining
    from blk_queue_bypass_start() while the queue is initializing to avoid
    introducing excessive delays during boot. INIT_DONE setting is moved
    above the initial blk_queue_bypass_end() so that bypassing attempts
    can't slip inbetween.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 May, 2014

1 commit


21 May, 2014

1 commit

  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Feb, 2014

1 commit

  • Witch to using a preallocated flush_rq for blk-mq similar to what's done
    with the old request path. This allows us to set up the request properly
    with a tag from the actually allowed range and ->rq_disk as needed by
    some drivers. To make life easier we also switch to dynamic allocation
    of ->flush_rq for the old path.

    This effectively reverts most of

    "blk-mq: fix for flush deadlock"

    and

    "blk-mq: Don't reserve a tag for flush request"

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jan, 2014

1 commit


15 Nov, 2013

1 commit


25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Sep, 2013

1 commit


04 Apr, 2013

1 commit

  • As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
    that use a value generated by queue_var_store independent of whether
    that value was set or not.

    block/blk-sysfs.c: In function 'queue_store_nonrot':
    block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]

    Unlike most other such warnings, this one is not a false positive,
    writing any non-number string into the sysfs files indeed has
    an undefined result, rather than returning an error.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

10 Jan, 2013

1 commit


06 Dec, 2012

1 commit

  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Sep, 2012

1 commit

  • …_init_allocated_queue()

    b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
    request_queues bypassed on allocation to avoid switching on and off
    bypass mode on a queue being initialized. Some drivers allocate and
    then destroy a lot of queues without fully initializing them and
    incurring bypass latency overhead on each of them could add upto
    significant overhead.

    Unfortunately, blk_init_allocated_queue() is never used by queues of
    bio-based drivers, which means that all bio-based driver queues are in
    bypass mode even after initialization and registration complete
    successfully.

    Due to the limited way request_queues are used by bio drivers, this
    problem is hidden pretty well but it shows up when blk-throttle is
    used in combination with a bio-based driver. Trying to configure
    (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
    indefinitely in blkg_conf_prep() waiting for bypass mode to end.

    This patch moves the initial blk_queue_bypass_end() call from
    blk_init_allocated_queue() to blk_register_queue() which is called for
    any userland-visible queues regardless of its type.

    I believe this is correct because I don't think there is any block
    driver which needs or wants working elevator and blk-cgroup on a queue
    which isn't visible to userland. If there are such users, we need a
    different solution.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Tejun Heo
     

20 Sep, 2012

1 commit

  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

09 Sep, 2012

1 commit

  • Instead of using simple_strtoul which "converts" invalid numbers to 0,
    use strict_strtoul and perform error checking to ensure that userspace
    passes us a valid unsigned long. This addresses problems with functions
    such as writev, which might want to write a trailing newline -- the
    newline should rightfully be rejected, but the value preceeding it
    should be preserved.

    Fixes BZ#46981.

    Signed-off-by: Dave Reisner
    Signed-off-by: Jens Axboe

    Dave Reisner
     

27 Jun, 2012

1 commit

  • Currently, request_queue has one request_list to allocate requests
    from regardless of blkcg of the IO being issued. When the unified
    request pool is used up, cfq proportional IO limits become meaningless
    - whoever grabs the next request being freed wins the race regardless
    of the configured weights.

    This can be easily demonstrated by creating a blkio cgroup w/ very low
    weight, put a program which can issue a lot of random direct IOs there
    and running a sequential IO from a different cgroup. As soon as the
    request pool is used up, the sequential IO bandwidth crashes.

    This patch implements per-blkg request_list. Each blkg has its own
    request_list and any IO allocates its request from the matching blkg
    making blkcgs completely isolated in terms of request allocation.

    * Root blkcg uses the request_list embedded in each request_queue,
    which was renamed to @q->root_rl from @q->rq. While making blkcg rl
    handling a bit harier, this enables avoiding most overhead for root
    blkcg.

    * Queue fullness is properly per request_list but bdi isn't blkcg
    aware yet, so congestion state currently just follows the root
    blkcg. As writeback isn't aware of blkcg yet, this works okay for
    async congestion but readahead may get the wrong signals. It's
    better than blkcg completely collapsing with shared request_list but
    needs to be improved with future changes.

    * After this change, each block cgroup gets a full request pool making
    resource consumption of each cgroup higher. This makes allowing
    non-root users to create cgroups less desirable; however, note that
    allowing non-root users to directly manage cgroups is already
    severely broken regardless of this patch - each block cgroup
    consumes kernel memory and skews IO weight (IO weights are not
    hierarchical).

    v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

    v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure. Fix it
    by falling back to root_rl on blkg lookup failures. This problem
    was spotted by Rakesh Iyer .

    v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue". blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo