21 Jul, 2016

1 commit

  • The target SCSI passthrough backend is much better served with the low-level
    blk_rq_append_bio construct then the helpers built on top of it, so export it.

    Also use the opportunity to remove the pointless request_queue argument and
    make the code flow a little more readable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

23 Dec, 2015

1 commit

  • Timer context is not very useful for drivers to perform any meaningful abort
    action from. So instead of calling the driver from this useless context
    defer it to a workqueue as soon as possible.

    Note that while a delayed_work item would seem the right thing here I didn't
    dare to use it due to the magic in blk_add_timer that pokes deep into timer
    internals. But maybe this encourages Tejun to add a sensible API for that to
    the workqueue API and we'll all be fine in the end :)

    Contains a major update from Keith Bush:

    "This patch removes synchronizing the timeout work so that the timer can
    start a freeze on its own queue. The timer enters the queue, so timer
    context can only start a freeze, but not wait for frozen."

    Signed-off-by: Christoph Hellwig
    Acked-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Nov, 2015

1 commit

  • Fix use after free crashes like the following:

    general protection fault: 0000 [#1] SMP
    Call Trace:
    [] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
    [] pmem_rw_page+0x42/0x80 [nd_pmem]
    [] bdev_read_page+0x50/0x60
    [] do_mpage_readpage+0x510/0x770
    [] ? I_BDEV+0x20/0x20
    [] ? lru_cache_add+0x1c/0x50
    [] mpage_readpages+0x107/0x170
    [] ? I_BDEV+0x20/0x20
    [] ? I_BDEV+0x20/0x20
    [] blkdev_readpages+0x1d/0x20
    [] __do_page_cache_readahead+0x28f/0x310
    [] ? __do_page_cache_readahead+0x169/0x310
    [] ? pagecache_get_page+0x2d/0x1d0
    [] filemap_fault+0x396/0x530
    [] __do_fault+0x4e/0xf0
    [] handle_mm_fault+0x11bd/0x1b50

    Cc:
    Cc: Jens Axboe
    Cc: Alexander Viro
    Reported-by: kbuild test robot
    Acked-by: Matthew Wilcox
    [willy: symmetry fixups]
    Signed-off-by: Dan Williams

    Dan Williams
     

05 Nov, 2015

1 commit

  • Pull block integrity updates from Jens Axboe:
    ""This is the joint work of Dan and Martin, cleaning up and improving
    the support for block data integrity"

    * 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
    block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
    block: blk_flush_integrity() for bio-based drivers
    block: move blk_integrity to request_queue
    block: generic request_queue reference counting
    nvme: suspend i/o during runtime blk_integrity_unregister
    md: suspend i/o during runtime blk_integrity_unregister
    md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
    block: Inline blk_integrity in struct gendisk
    block: Export integrity data interval size in sysfs
    block: Reduce the size of struct blk_integrity
    block: Consolidate static integrity profile properties
    block: Move integrity kobject to struct gendisk

    Linus Torvalds
     

22 Oct, 2015

3 commits

  • Request queues with merging disabled will not flush the plug list after
    BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
    on blk_attempt_plug_merge to compute the request_count. Fix this by
    computing the number of queued requests even for nomerge queues.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Since they lack requests to pin the request_queue active, synchronous
    bio-based drivers may have in-flight integrity work from
    bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush
    that work to prevent races to free the queue and the final usage of the
    blk_integrity profile.

    This is temporary unless/until bio-based drivers start to generically
    take a q_usage_counter reference while a bio is in-flight.

    Cc: Martin K. Petersen
    [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Allow pmem, and other synchronous/bio-based block drivers, to fallback
    on a per-cpu reference count managed by the core for tracking queue
    live/dead state.

    The existing per-cpu reference count for the blk_mq case is promoted to
    be used in all block i/o scenarios. This involves initializing it by
    default, waiting for it to drop to zero at exit, and holding a live
    reference over the invocation of q->make_request_fn() in
    generic_make_request(). The blk_mq code continues to take its own
    reference per blk_mq request and retains the ability to freeze the
    queue, but the check that the queue is frozen is moved to
    generic_make_request().

    This fixes crash signatures like the following:

    BUG: unable to handle kernel paging request at ffff880140000000
    [..]
    Call Trace:
    [] ? copy_user_handle_tail+0x5f/0x70
    [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
    [] pmem_make_request+0xd1/0x200 [nd_pmem]
    [] ? mempool_alloc+0x72/0x1a0
    [] generic_make_request+0xd6/0x110
    [] submit_bio+0x76/0x170
    [] submit_bh_wbc+0x12f/0x160
    [] submit_bh+0x12/0x20
    [] jbd2_write_superblock+0x8d/0x170
    [] jbd2_mark_journal_empty+0x5d/0x90
    [] jbd2_journal_destroy+0x24b/0x270
    [] ? put_pwq_unlocked+0x2a/0x30
    [] ? destroy_workqueue+0x225/0x250
    [] ext4_put_super+0x64/0x360
    [] generic_shutdown_super+0x6a/0xf0

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

19 Aug, 2015

1 commit

  • blkg (blkcg_gq) currently is created by blkcg policies invoking
    blkg_lookup_create() which ends up repeating about the same code in
    different policies. Theoretically, this can avoid the overhead of
    looking and/or creating blkg's if blkcg is enabled but no policy is in
    use; however, the cost of blkg lookup / creation is very low
    especially if only the root blkcg is in use which is highly likely if
    no blkcg policy is in active use - it boils down to a single very
    predictable conditional and surrounding RCU protection.

    This patch consolidates blkg creation to a new function
    blkcg_bio_issue_check() which is called during bio issue from
    generic_make_request_checks(). blkcg_bio_issue_check() is now the
    only function which tries to create missing blkg's. The subsequent
    policy and request_list operations just perform blkg_lookup() and if
    missing falls back to the root.

    * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
    instead of blkg_lookup_create().

    * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
    read locked and blkg already looked up. Both throtl_lookup_tg() and
    throtl_lookup_create_tg() are dropped.

    * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
    cfq_lookup_cfqg()which uses blkg_lookup().

    This consolidates blkg handling and avoids unnecessary blkg creation
    retries under memory pressure. In addition, this provides a common
    bio entry point into blkcg where things like common accounting can be
    performed.

    v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
    !CONFIG_BLK_DEV_THROTTLING.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Aug, 2015

1 commit

  • Inside timeout handler, blk_mq_tag_to_rq() is called
    to retrieve the request from one tag. This way is obviously
    wrong because the request can be freed any time and some
    fiedds of the request can't be trusted, then kernel oops
    might be triggered[1].

    Currently wrt. blk_mq_tag_to_rq(), the only special case is
    that the flush request can share same tag with the request
    cloned from, and the two requests can't be active at the same
    time, so this patch fixes the above issue by updating tags->rqs[tag]
    with the active request(either flush rq or the request cloned
    from) of the tag.

    Also blk_mq_tag_to_rq() gets much simplified with this patch.

    Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
    make sure the request can't be freed, so in bt_for_each() this
    helper is replaced with tags->rqs[tag].

    [1] kernel oops log
    [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
    [ 439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
    [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
    [ 439.700653] Dumping ftrace buffer:^M
    [ 439.700653] (ftrace buffer empty)^M
    [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
    [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
    [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
    [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
    [ 439.730500] RIP: 0010:[] [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
    [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
    [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
    [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
    [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
    [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
    [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
    [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
    [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
    [ 439.730500] Stack:^M
    [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
    [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
    [ 439.755663] Call Trace:^M
    [ 439.755663] ^M
    [ 439.755663] [] bt_for_each+0x6e/0xc8^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] blk_mq_tag_busy_iter+0x55/0x5e^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] blk_mq_rq_timer+0x5d/0xd4^M
    [ 439.755663] [] call_timer_fn+0xf7/0x284^M
    [ 439.755663] [] ? call_timer_fn+0x5/0x284^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] run_timer_softirq+0x1ce/0x1f8^M
    [ 439.755663] [] __do_softirq+0x181/0x3a4^M
    [ 439.755663] [] irq_exit+0x40/0x94^M
    [ 439.755663] [] smp_apic_timer_interrupt+0x33/0x3e^M
    [ 439.755663] [] apic_timer_interrupt+0x84/0x90^M
    [ 439.755663] ^M
    [ 439.755663] [] ? _raw_spin_unlock_irq+0x32/0x4a^M
    [ 439.755663] [] finish_task_switch+0xe0/0x163^M
    [ 439.755663] [] ? finish_task_switch+0xa2/0x163^M
    [ 439.755663] [] __schedule+0x469/0x6cd^M
    [ 439.755663] [] schedule+0x82/0x9a^M
    [ 439.789267] [] signalfd_read+0x186/0x49a^M
    [ 439.790911] [] ? wake_up_q+0x47/0x47^M
    [ 439.790911] [] __vfs_read+0x28/0x9f^M
    [ 439.790911] [] ? __fget_light+0x4d/0x74^M
    [ 439.790911] [] vfs_read+0x7a/0xc6^M
    [ 439.790911] [] SyS_read+0x49/0x7f^M
    [ 439.790911] [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
    [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
    f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
    53 38 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
    ^M
    [ 439.790911] RIP [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.790911] RSP ^M
    [ 439.790911] CR2: 0000000000000158^M
    [ 439.790911] ---[ end trace d40af58949325661 ]---^M

    Cc:
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 May, 2015

1 commit

  • Last patch makes plug work for multiple queue case. However it only
    works for single disk case, because it assumes only one request in the
    plug list. If a task is accessing multiple disks, eg MD/DM, the
    assumption is wrong. Let blk_attempt_plug_merge() record request from
    the same queue.

    V2: use NULL parameter in !mq case. Fix a bug. Add comments in
    blk_attempt_plug_merge to make it less (hopefully) confusion.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

06 May, 2015

1 commit


26 Sep, 2014

5 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
    so that this function can find the corresponding blk_flush_queue
    bound with current mq context since the flush queue will become
    per hw-queue.

    For legacy queue, the parameter can be simply 'NULL'.

    For multiqueue case, the parameter should be set as the context
    from which the related request is originated. With this context
    info, the hw queue and related flush queue can be found easily.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch introduces 'struct blk_flush_queue' and puts all
    flush machinery related fields into this structure, so that

    - flush implementation details aren't exposed to driver
    - it is easy to convert to per dispatch-queue flush machinery

    This patch is basically a mechanical replacement.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Sep, 2014

1 commit


12 Jun, 2014

1 commit


21 May, 2014

1 commit

  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 May, 2014

1 commit

  • This adds support for active queue tracking, meaning that the
    blk-mq tagging maintains a count of active users of a tag set.
    This allows us to maintain a notion of fairness between users,
    so that we can distribute the tag depth evenly without starving
    some users while allowing others to try unfair deep queues.

    If sharing of a tag set is detected, each hardware queue will
    track the depth of its own queue. And if this exceeds the total
    depth divided by the number of active queues, the user is actively
    throttled down.

    The active queue count is done lazily to avoid bouncing that data
    between submitter and completer. Each hardware queue gets marked
    active when it allocates its first tag, and gets marked inactive
    when 1) the last tag is cleared, and 2) the queue timeout grace
    period has passed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Apr, 2014

1 commit

  • If a requeue event races with a timeout, we can get into the
    situation where we attempt to complete a request from the
    timeout handler when it's not start anymore. This causes a crash.
    So have the timeout handler check that REQ_ATOM_STARTED is still
    set on the request - if not, we ignore the event. If this happens,
    the request has now been marked as complete. As a consequence, we
    need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
    as to maintain proper request state.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2014

1 commit

  • Martin reported that his test system would not boot with
    current git, it oopsed with this:

    BUG: unable to handle kernel paging request at ffff88046c6c9e80
    IP: [] blk_queue_start_tag+0x90/0x150
    PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
    Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
    rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
    CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
    Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
    Workqueue: events_unbound async_run_entry_fn
    task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
    RIP: 0010:[] []
    blk_queue_start_tag+0x90/0x150
    RSP: 0018:ffff880273d03a58 EFLAGS: 00010092
    RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
    RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
    RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
    R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
    R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
    FS: 0000000000000000(0000) GS:ffff880277b00000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
    Stack:
    ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
    ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
    ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
    Call Trace:
    [] scsi_request_fn+0xf2/0x510
    [] __blk_run_queue+0x37/0x50
    [] blk_execute_rq_nowait+0xb3/0x130
    [] blk_execute_rq+0x64/0xf0
    [] ? bit_waitqueue+0xd0/0xd0
    [] scsi_execute+0xe5/0x180
    [] scsi_execute_req_flags+0x9a/0x110
    [] sd_spinup_disk+0x94/0x460 [sd_mod]
    [] ? __unmap_hugepage_range+0x200/0x2f0
    [] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
    [] sd_probe_async+0xd8/0x200 [sd_mod]
    [] async_run_entry_fn+0x3f/0x140
    [] process_one_work+0x175/0x410
    [] worker_thread+0x123/0x400
    [] ? manage_workers+0x160/0x160
    [] kthread+0xce/0xf0
    [] ? kthread_freezable_should_stop+0x70/0x70
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_freezable_should_stop+0x70/0x70
    Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
    df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 89
    58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
    RIP [] blk_queue_start_tag+0x90/0x150
    RSP
    CR2: ffff88046c6c9e80

    Martin bisected and found this to be the problem patch;

    commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f
    Author: Jan Kara
    Date: Mon Feb 24 16:39:54 2014 +0100

    block: Stop abusing rq->csd.list in blk-softirq

    and the problem was immediately apparent. The patch states that
    it is safe to reuse queuelist at completion time, since it is
    no longer used. However, that is not true if a device is using
    block enabled tagging. If that is the case, then the queuelist
    is reused to keep track of busy tags. If a device also ended
    up using softirq completions, we'd reuse ->queuelist for the
    IPI handling while block tagging was still using it. Boom.

    Fix this by adding a new ipi_list list head, and share the
    memory used with the request hash table. The hash table is
    never used after the request is moved to the dispatch list,
    which happens long before any potential completion of the
    request. Add a new request bit for this, so we don't have
    cases that check rq->hash while it could potentially have
    been reused for the IPI completion.

    Reported-by: Martin K. Petersen
    Tested-by: Benjamin Herrenschmidt
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Jan, 2014

1 commit

  • request_queue bypassing is used to suppress higher-level function of a
    request_queue so that they can be switched, reconfigured and shut
    down. A request_queue does the followings while bypassing.

    * bypasses elevator and io_cq association and queues requests directly
    to the FIFO dispatch queue.

    * bypasses block cgroup request_list lookup and always uses the root
    request_list.

    Once confirmed to be bypassing, specific elevator and block cgroup
    policy implementations can assume that nothing is in flight for them
    and perform various operations which would be dangerous otherwise.

    Such confirmation is acheived by short-circuiting all new requests
    directly to the dispatch queue and waiting for all the requests which
    were issued before to finish. Unfortunately, while the request
    allocating and draining sides were properly handled, we forgot to
    actually plug the request dispatch path. Even after bypassing mode is
    confirmed, if the attached driver tries to fetch a request and the
    dispatch queue is empty, __elv_next_request() would invoke the current
    elevator's elevator_dispatch_fn() callback. As all in-flight requests
    were drained, the elevator wouldn't contain any request but once
    bypass is confirmed we don't even know whether the elevator is even
    there. It might be in the process of being switched and half torn
    down.

    Frank Mayhar reports that this actually happened while switching
    elevators, leading to an oops.

    Let's fix it by making __elv_next_request() avoid invoking the
    elevator_dispatch_fn() callback if the queue is bypassing. It already
    avoids invoking the callback if the queue is dying. As a dying queue
    is guaranteed to be bypassing, we can simply replace blk_queue_dying()
    check with blk_queue_bypass().

    Reported-by: Frank Mayhar
    References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com
    Cc: stable@vger.kernel.org
    Tested-by: Frank Mayhar

    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jan, 2013

1 commit

  • Switch elevator to use the new hashtable implementation. This reduces the
    amount of generic unrelated code in the elevator.

    This also removes the dymanic allocation of the hash table. The size of the table is
    constant so there's no point in paying the price of an extra dereference when accessing
    it.

    This patch depends on d9b482c ("hashtable: introduce a small and naive
    hashtable") which was merged in v3.6.

    Signed-off-by: Sasha Levin
    Signed-off-by: Jens Axboe

    Sasha Levin
     

06 Dec, 2012

2 commits

  • A block driver may start cleaning up resources needed by its
    request_fn as soon as blk_cleanup_queue() finished, so request_fn
    must not be invoked after draining finished. This is important
    when blk_run_queue() is invoked without any requests in progress.
    As an example, if blk_drain_queue() and scsi_run_queue() run in
    parallel, blk_drain_queue() may have finished all requests after
    scsi_run_queue() has taken a SCSI device off the starved list but
    before that last function has had a chance to run the queue.

    Signed-off-by: Bart Van Assche
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Chanho Min
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Sep, 2012

1 commit

  • Remove special-casing of non-rw fs style requests (discard). The nomerge
    flags are consolidated in blk_types.h, and rq_mergeable() and
    bio_mergeable() have been modified to use them.

    bio_is_rw() is used in place of bio_has_data() a few places. This is
    done to to distinguish true reads and writes from other fs type requests
    that carry a payload (e.g. write same).

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

01 Aug, 2012

1 commit

  • __generic_unplug_device() function is removed with commit
    7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
    remove the declaration at meantime. Here remove it.

    Signed-off-by: Yuanhan Liu
    Signed-off-by: Jens Axboe

    Yuanhan Liu
     

25 Jun, 2012

1 commit

  • Request allocation is about to be made per-blkg meaning that there'll
    be multiple request lists.

    * Make queue full state per request_list. blk_*queue_full() functions
    are renamed to blk_*rl_full() and takes @rl instead of @q.

    * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
    instead of @q. Also add @gfp_mask parameter.

    * Add blk_exit_rl() instead of destroying rl directly from
    blk_release_queue().

    * Add request_list->q and make request alloc/free functions -
    blk_free_request(), [__]freed_request(), __get_request() - take @rl
    instead of @q.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Apr, 2012

1 commit

  • cgroup/for-3.5 contains the following changes which blk-cgroup needs
    to proceed with the on-going cleanup.

    * Dynamic addition and removal of cftypes to make config/stat file
    handling modular for policies.

    * cgroup removal update to not wait for css references to drain to fix
    blkcg removal hang caused by cfq caching cfqgs.

    Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
    following conflicts in block/blk-cgroup.c.

    * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
    conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
    removal. Resolved by removing @subsys from all subsys methods.

    * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
    controllers" conflicts with ->pre_destroy() and ->attach() updates
    and removal of modular config. Resolved by dropping forward
    declarations of the methods and applying updates to the relocated
    blkio_subsys.

    * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
    cftype interface" builds upon the previous item. Resolved by adding
    ->base_cftypes to the relocated blkio_subsys.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

07 Mar, 2012

3 commits

  • Make the following interface updates to prepare for future ioc related
    changes.

    * create_io_context() returning ioc only works for %current because it
    doesn't increment ref on the ioc. Drop @task parameter from it and
    always assume %current.

    * Make create_io_context_slowpath() return 0 or -errno and rename it
    to create_task_io_context().

    * Make ioc_create_icq() take @ioc as parameter instead of assuming
    that of %current. The caller, get_request(), is updated to create
    ioc explicitly and then pass it into ioc_create_icq().

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently block core calls directly into blk-throttle for init, drain
    and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
    the blk-throttle functions. This is to give more control and
    visiblity to blkcg core layer for proper layering. Further patches
    will add logic common to blkcg policies to the functions.

    While at it, collapse blk_throtl_release() into blk_throtl_exit().
    There's no reason to keep them separate.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Rename and extend elv_queisce_start/end() to
    blk_queue_bypass_start/end() which are exported and supports nesting
    via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
    state.

    This will be further extended and used for blkio_group management.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Mar, 2012

1 commit


08 Feb, 2012

1 commit

  • blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
    test. blk_try_merge() determines merge direction and expects the
    caller to have tested elv_rq_merge_ok() previously.

    elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
    elv_iosched_allow_merge(). elv_try_merge() is removed and the two
    callers are updated to call elv_rq_merge_ok() explicitly followed by
    blk_try_merge(). While at it, make rq_merge_ok() functions return
    bool.

    This is to prepare for plug merge update and doesn't introduce any
    behavior change.

    This is based on Jens' patch to skip elevator_allow_merge_fn() from
    plug merge.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Original-patch-by: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 Jan, 2012

1 commit

  • The block layer has some code trying to determine if two CPUs share a
    cache, the scheduler has a similar function. Expose the function used
    by the scheduler and make the block layer use it, thereby removing the
    block layers usage of CONFIG_SCHED* and topology bits.

    Signed-off-by: Peter Zijlstra
    Acked-by: Jens Axboe
    Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins

    Peter Zijlstra
     

14 Dec, 2011

2 commits

  • Now block layer knows everything necessary to create and associate
    icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
    get_request() such that, if elevator_type->icq_size is set, requests
    are automatically associated with their matching icq's before
    elv_set_request(). io_context reference is also managed by block core
    on request alloc/free.

    * Only ioprio/cgroup changed handling remains from cfq_get_cic().
    Collapsed into cfq_set_request().

    * This removes queue kicking on icq allocation failure (for now). As
    icq allocation failure is rare and the only effect of queue kicking
    achieved was possibily accelerating queue processing, this change
    shouldn't be noticeable.

    There is a larger underlying problem. Unlike request allocation,
    icq allocation is not guaranteed to succeed eventually after
    retries. The number of icq is unbound and thus mempool can't be the
    solution either. This effectively adds allocation dependency on
    memory free path and thus possibility of deadlock.

    This usually wouldn't happen because icq allocation is not a hot
    path and, even when the condition triggers, it's highly unlikely
    that none of the writeback workers already has icq.

    However, this is still possible especially if elevator is being
    switched under high memory pressure, so we better get it fixed.
    Probably the only solution is just bypassing elevator and appending
    to dispatch queue on any elevator allocation failure.

    * Comment added to explain how icq's are managed and synchronized.

    This completes cleanup of io_context interface.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
    blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
    with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
    and q, and freeing automatically handled by blk-ioc. The elevator
    operation only need to perform exit operation specific to the elevator
    - in cfq's case, exiting the cfqq's.

    Also, clearing of io_cq's on q detach is moved to block core and
    automatically performed on elevator switch and q release.

    Because the q io_cq points to might be freed before RCU callback for
    the io_cq runs, blk-ioc code should remember to which cache the io_cq
    needs to be freed when the io_cq is released. New field
    io_cq->__rcu_icq_cache is added for this purpose. As both the new
    field and rcu_head are used only after io_cq is released and the
    q/ioc_node fields aren't, they are put into unions.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo