21 Nov, 2015

1 commit

  • Liu reported that running certain parts of xfstests threw the
    following error:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
    in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
    3 locks held by kworker/u16:0/6:
    #0: ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
    #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
    #2: (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
    CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
    Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
    Workqueue: writeback wb_workfn (flush-btrfs-108)
    ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
    0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
    ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
    Call Trace:
    [] dump_stack+0x4f/0x74
    [] ___might_sleep+0x185/0x240
    [] __might_sleep+0x52/0x90
    [] __alloc_pages_nodemask+0x268/0x410
    [] ? sched_clock_local+0x1c/0x90
    [] ? local_clock+0x21/0x40
    [] ? __lock_release+0x420/0x510
    [] ? __lock_acquired+0x16c/0x3c0
    [] alloc_pages_current+0xc5/0x210
    [] ? rbio_is_full+0x55/0x70 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] ? _raw_spin_unlock_irqrestore+0x40/0x60
    [] full_stripe_write+0x5a/0xc0 [btrfs]
    [] __raid56_parity_write+0x39/0x60 [btrfs]
    [] run_plug+0x11b/0x140 [btrfs]
    [] btrfs_raid_unplug+0x23/0x70 [btrfs]
    [] blk_flush_plug_list+0x82/0x1f0
    [] blk_sq_make_request+0x1f9/0x740
    [] ? generic_make_request_checks+0x222/0x7c0
    [] ? blk_queue_enter+0x124/0x310
    [] ? blk_queue_enter+0x92/0x310
    [] generic_make_request+0x172/0x2c0
    [] ? generic_make_request+0x164/0x2c0
    [] submit_bio+0x70/0x140
    [] ? rbio_add_io_page+0x99/0x150 [btrfs]
    [] finish_rmw+0x4d9/0x600 [btrfs]
    [] full_stripe_write+0x9c/0xc0 [btrfs]
    [] raid56_parity_write+0xef/0x160 [btrfs]
    [] btrfs_map_bio+0xe3/0x2d0 [btrfs]
    [] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
    [] submit_one_bio+0x74/0xb0 [btrfs]
    [] submit_extent_page+0xe5/0x1c0 [btrfs]
    [] __extent_writepage_io+0x408/0x4c0 [btrfs]
    [] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
    [] __extent_writepage+0x218/0x3a0 [btrfs]
    [] ? mark_held_locks+0x78/0xa0
    [] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
    [] extent_writepages+0x52/0x70 [btrfs]
    [] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
    [] btrfs_writepages+0x27/0x30 [btrfs]
    [] do_writepages+0x23/0x40
    [] __writeback_single_inode+0x89/0x4d0
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x260/0x480
    [] ? writeback_sb_inodes+0x15f/0x480
    [] writeback_sb_inodes+0x2d2/0x480
    [] ? down_read_trylock+0x57/0x60
    [] ? trylock_super+0x25/0x60
    [] ? rcu_read_lock_sched_held+0x4f/0x90
    [] __writeback_inodes_wb+0x8c/0xc0
    [] wb_writeback+0x2b5/0x500
    [] ? mark_held_locks+0x78/0xa0
    [] ? __local_bh_enable_ip+0x68/0xc0
    [] ? wb_do_writeback+0x62/0x310
    [] wb_do_writeback+0xc1/0x310
    [] ? set_worker_desc+0x79/0x90
    [] wb_workfn+0x92/0x330
    [] process_one_work+0x223/0x730
    [] ? process_one_work+0x173/0x730
    [] ? worker_thread+0x18f/0x430
    [] worker_thread+0x11d/0x430
    [] ? maybe_create_worker+0xf0/0xf0
    [] ? maybe_create_worker+0xf0/0xf0
    [] kthread+0xef/0x110
    [] ? schedule_tail+0x1e/0xd0
    [] ? __init_kthread_worker+0x70/0x70
    [] ret_from_fork+0x3f/0x70
    [] ? __init_kthread_worker+0x70/0x70

    The issue is that we've got the software context pinned while
    calling blk_flush_plug_list(), which flushes callbacks that
    are allowed to sleep. btrfs and raid has such callbacks.

    Flip the checks around a bit, so we can enable preempt a bit
    earlier and flush plugs without having preempt disabled.

    This only affects blk-mq driven devices, and only those that
    register a single queue.

    Reported-by: Liu Bo
    Tested-by: Liu Bo
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Nov, 2015

1 commit


11 Nov, 2015

1 commit

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     

08 Nov, 2015

2 commits


07 Nov, 2015

2 commits

  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

05 Nov, 2015

2 commits

  • Pull block integrity updates from Jens Axboe:
    ""This is the joint work of Dan and Martin, cleaning up and improving
    the support for block data integrity"

    * 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
    block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
    block: blk_flush_integrity() for bio-based drivers
    block: move blk_integrity to request_queue
    block: generic request_queue reference counting
    nvme: suspend i/o during runtime blk_integrity_unregister
    md: suspend i/o during runtime blk_integrity_unregister
    md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
    block: Inline blk_integrity in struct gendisk
    block: Export integrity data interval size in sysfs
    block: Reduce the size of struct blk_integrity
    block: Consolidate static integrity profile properties
    block: Move integrity kobject to struct gendisk

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds
     

03 Nov, 2015

1 commit

  • Hi,

    Zhangqing Luo reported long boot times on a system with thousands of
    LUNs when scsi-mq was enabled. He narrowed the problem down to
    blk_mq_add_queue_tag_set, where every queue is frozen in order to set
    the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
    added before it in sequence, which involves waiting for an RCU grace
    period for each one. We don't need to do this. After the second queue
    is added, only new queues need to be initialized with the shared tag.
    We can do that by percolating the flag up to the blk_mq_tag_set, and
    updating the newly added queue's hctxs if the flag is set.

    This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
    support for shared tags maps).

    Reported-and-tested-by: Jason Luo
    Reviewed-by: Ming Lei
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

22 Oct, 2015

5 commits

  • Most of times, flush plug should be the hottest I/O path,
    so mark ctx as pending after all requests in the list are
    inserted.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The trace point is for tracing plug event of each request
    queue instead of each task, so we should check the request
    count in the plug list from current queue instead of
    current task.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It isn't necessary to try to merge the bio which is marked
    as NOMERGE.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Request queues with merging disabled will not flush the plug list after
    BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
    on blk_attempt_plug_merge to compute the request_count. Fix this by
    computing the number of queued requests even for nomerge queues.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Allow pmem, and other synchronous/bio-based block drivers, to fallback
    on a per-cpu reference count managed by the core for tracking queue
    live/dead state.

    The existing per-cpu reference count for the blk_mq case is promoted to
    be used in all block i/o scenarios. This involves initializing it by
    default, waiting for it to drop to zero at exit, and holding a live
    reference over the invocation of q->make_request_fn() in
    generic_make_request(). The blk_mq code continues to take its own
    reference per blk_mq request and retains the ability to freeze the
    queue, but the check that the queue is frozen is moved to
    generic_make_request().

    This fixes crash signatures like the following:

    BUG: unable to handle kernel paging request at ffff880140000000
    [..]
    Call Trace:
    [] ? copy_user_handle_tail+0x5f/0x70
    [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
    [] pmem_make_request+0xd1/0x200 [nd_pmem]
    [] ? mempool_alloc+0x72/0x1a0
    [] generic_make_request+0xd6/0x110
    [] submit_bio+0x76/0x170
    [] submit_bh_wbc+0x12f/0x160
    [] submit_bh+0x12/0x20
    [] jbd2_write_superblock+0x8d/0x170
    [] jbd2_mark_journal_empty+0x5d/0x90
    [] jbd2_journal_destroy+0x24b/0x270
    [] ? put_pwq_unlocked+0x2a/0x30
    [] ? destroy_workqueue+0x225/0x250
    [] ext4_put_super+0x64/0x360
    [] generic_shutdown_super+0x6a/0xf0

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

15 Oct, 2015

1 commit

  • tags is freed in blk_mq_free_rq_map() and should not be used after that.
    The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
    free_cpumask_var() is nop.

    tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
    free cpumask in its counter part, blk_mq_free_tags().

    Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
    Signed-off-by: Jun'ichi Nomura
    Cc: Keith Busch
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

10 Oct, 2015

1 commit


01 Oct, 2015

2 commits

  • And replace the blk_mq_tag_busy_iter with it - the driver use has been
    replaced with a new helper a while ago, and internal to the block we
    only need the new version.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_complete_request may be a no-op if the request has already
    been completed by others means (e.g. a timeout or cancellation), but
    currently drivers have to set rq->errors before calling
    blk_mq_complete_request, which might leave us with the wrong error value.

    Add an error parameter to blk_mq_complete_request so that we can
    defer setting rq->errors until we known we won the race to complete the
    request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Sep, 2015

6 commits

  • CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
    all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
    entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
    be blocked until the active reference of the kernfs_node to be zero.

    On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
    /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
    blk_mq_hw_sysfs_cpus_show().

    If these happen at the same time, a deadlock can happen. Because one
    can wait for the active reference to be zero with holding all_q_mutex,
    and the other tries to acquire all_q_mutex with holding the active
    reference.

    The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
    is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
    entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
    and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
    while hctx->cpumask is being updated.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Ming Lei
    Cc: Wanpeng Li
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • Notifier callbacks for CPU_ONLINE action can be run on the other CPU
    than the CPU which was just onlined. So it is possible for the
    process running on the just onlined CPU to insert request and run
    hw queue before establishing new mapping which is done by
    blk_mq_queue_reinit_notify().

    This can cause a problem when the CPU has just been onlined first time
    since the request queue was initialized. At this time ctx->index_hw
    for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
    zero before blk_mq_queue_reinit_notify() is called by notifier
    callbacks for CPU_ONLINE action.

    For example, there is a single hw queue (hctx) and two CPU queues
    (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
    a request is inserted into ctx1->rq_list and set bit0 in pending
    bitmap as ctx1->index_hw is still zero.

    And then while running hw queue, flush_busy_ctxs() finds bit0 is set
    in pending bitmap and tries to retrieve requests in
    hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
    request in ctx1->rq_list is ignored.

    Fix it by ensuring that new mapping is established before onlined cpu
    starts running.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Jens Axboe
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
    q->mq_usage_counter while freezing all request queues in all_q_list.
    On the other hand, q->mq_usage_counter is deinitialized in
    blk_mq_free_queue() before deleting the queue from all_q_list.

    So if CPU hotplug event occurs in the window, percpu_ref_kill() is
    called with q->mq_usage_counter which has already been marked dead,
    and it triggers warning. Fix it by deleting the queue from all_q_list
    earlier than destroying q->mq_usage_counter.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
    q->mq_map by blk_mq_update_queue_map() for all request queues in
    all_q_list. On the other hand, q->mq_map is released before deleting
    the queue from all_q_list.

    So if CPU hotplug event occurs in the window, invalid memory access
    can happen. Fix it by releasing q->mq_map in blk_mq_release() to make
    it happen latter than removal from all_q_list.

    Signed-off-by: Akinobu Mita
    Suggested-by: Ming Lei
    Reviewed-by: Ming Lei
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • There is a race between cpu hotplug handling and adding/deleting
    gendisk for blk-mq, where both are trying to register and unregister
    the same sysfs entries.

    null_add_dev
    --> blk_mq_init_queue
    --> blk_mq_init_allocated_queue
    --> add to 'all_q_list' (*)
    --> add_disk
    --> blk_register_queue
    --> blk_mq_register_disk (++)

    null_del_dev
    --> del_gendisk
    --> blk_unregister_queue
    --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
    --> blk_mq_free_queue
    --> del from 'all_q_list' (*)

    blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

    While the request queue is added to 'all_q_list' (*),
    blk_mq_queue_reinit() can be called for the queue anytime by CPU
    hotplug callback. But blk_mq_sysfs_unregister (-) and
    blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
    before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
    is finished. Because '/sys/block/*/mq/' is not exists.

    There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
    be used to track these sysfs stuff, but it is only fixing this issue
    partially.

    In order to fix it completely, we just need per-queue flag instead of
    per-hctx flag with appropriate locking. So this introduces
    q->mq_sysfs_init_done which is properly protected with all_q_mutex.

    Also, we need to ensure that blk_mq_map_swqueue() is called with
    all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
    updated in blk_mq_map_swqueue(), so we should avoid
    blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
    in CPU hotplug handling or adding/deleting gendisk .

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • When unmapped hw queue is remapped after CPU topology is changed,
    hctx->tags->cpumask has to be set after hctx->tags is setup in
    blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

    Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
    Signed-off-by: Akinobu Mita
    Cc: Keith Busch
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

24 Sep, 2015

1 commit

  • The pages allocated for struct request contain pointers to other slab
    allocations (via ops->init_request). Since kmemleak does not track/scan
    page allocations, the slab objects will be reported as leaks (false
    positives). This patch adds kmemleak callbacks to allow tracking of such
    pages.

    Signed-off-by: Catalin Marinas
    Reported-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Catalin Marinas
     

15 Aug, 2015

1 commit

  • Inside timeout handler, blk_mq_tag_to_rq() is called
    to retrieve the request from one tag. This way is obviously
    wrong because the request can be freed any time and some
    fiedds of the request can't be trusted, then kernel oops
    might be triggered[1].

    Currently wrt. blk_mq_tag_to_rq(), the only special case is
    that the flush request can share same tag with the request
    cloned from, and the two requests can't be active at the same
    time, so this patch fixes the above issue by updating tags->rqs[tag]
    with the active request(either flush rq or the request cloned
    from) of the tag.

    Also blk_mq_tag_to_rq() gets much simplified with this patch.

    Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
    make sure the request can't be freed, so in bt_for_each() this
    helper is replaced with tags->rqs[tag].

    [1] kernel oops log
    [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
    [ 439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
    [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
    [ 439.700653] Dumping ftrace buffer:^M
    [ 439.700653] (ftrace buffer empty)^M
    [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
    [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
    [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
    [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
    [ 439.730500] RIP: 0010:[] [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
    [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
    [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
    [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
    [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
    [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
    [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
    [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
    [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
    [ 439.730500] Stack:^M
    [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
    [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
    [ 439.755663] Call Trace:^M
    [ 439.755663] ^M
    [ 439.755663] [] bt_for_each+0x6e/0xc8^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] blk_mq_tag_busy_iter+0x55/0x5e^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] blk_mq_rq_timer+0x5d/0xd4^M
    [ 439.755663] [] call_timer_fn+0xf7/0x284^M
    [ 439.755663] [] ? call_timer_fn+0x5/0x284^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] run_timer_softirq+0x1ce/0x1f8^M
    [ 439.755663] [] __do_softirq+0x181/0x3a4^M
    [ 439.755663] [] irq_exit+0x40/0x94^M
    [ 439.755663] [] smp_apic_timer_interrupt+0x33/0x3e^M
    [ 439.755663] [] apic_timer_interrupt+0x84/0x90^M
    [ 439.755663] ^M
    [ 439.755663] [] ? _raw_spin_unlock_irq+0x32/0x4a^M
    [ 439.755663] [] finish_task_switch+0xe0/0x163^M
    [ 439.755663] [] ? finish_task_switch+0xa2/0x163^M
    [ 439.755663] [] __schedule+0x469/0x6cd^M
    [ 439.755663] [] schedule+0x82/0x9a^M
    [ 439.789267] [] signalfd_read+0x186/0x49a^M
    [ 439.790911] [] ? wake_up_q+0x47/0x47^M
    [ 439.790911] [] __vfs_read+0x28/0x9f^M
    [ 439.790911] [] ? __fget_light+0x4d/0x74^M
    [ 439.790911] [] vfs_read+0x7a/0xc6^M
    [ 439.790911] [] SyS_read+0x49/0x7f^M
    [ 439.790911] [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
    [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
    f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
    53 38 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
    ^M
    [ 439.790911] RIP [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.790911] RSP ^M
    [ 439.790911] CR2: 0000000000000158^M
    [ 439.790911] ---[ end trace d40af58949325661 ]---^M

    Cc:
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

14 Aug, 2015

1 commit

  • The way the block layer is currently written, it goes to great lengths
    to avoid having to split bios; upper layer code (such as bio_add_page())
    checks what the underlying device can handle and tries to always create
    bios that don't need to be split.

    But this approach becomes unwieldy and eventually breaks down with
    stacked devices and devices with dynamic limits, and it adds a lot of
    complexity. If the block layer could split bios as needed, we could
    eliminate a lot of complexity elsewhere - particularly in stacked
    drivers. Code that creates bios can then create whatever size bios are
    convenient, and more importantly stacked drivers don't have to deal with
    both their own bio size limitations and the limitations of the
    (potentially multiple) devices underneath them. In the future this will
    let us delete merge_bvec_fn and a bunch of other code.

    We do this by adding calls to blk_queue_split() to the various
    make_request functions that need it - a few can already handle arbitrary
    size bios. Note that we add the call _after_ any call to
    blk_queue_bounce(); this means that blk_queue_split() and
    blk_recalc_rq_segments() don't need to be concerned with bouncing
    affecting segment merging.

    Some make_request_fn() callbacks were simple enough to audit and verify
    they don't need blk_queue_split() calls. The skipped ones are:

    * nfhd_make_request (arch/m68k/emu/nfblock.c)
    * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
    * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
    * brd_make_request (ramdisk - drivers/block/brd.c)
    * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
    * loop_make_request
    * null_queue_bio
    * bcache's make_request fns

    Some others are almost certainly safe to remove now, but will be left
    for future patches.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Ming Lei
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Lars Ellenberg
    Cc: drbd-user@lists.linbit.com
    Cc: Jiri Kosina
    Cc: Geoff Levand
    Cc: Jim Paris
    Cc: Philip Kelleher
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Oleg Drokin
    Cc: Andreas Dilger
    Acked-by: NeilBrown (for the 'md/md.c' bits)
    Acked-by: Mike Snitzer
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Kent Overstreet
    [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 Jul, 2015

1 commit

  • It is reasonable to set default timeout of request as 30 seconds instead of
    30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
    based systems may choose 100 HZ.

    Signed-off-by: Ming Lei
    Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Jun, 2015

1 commit

  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     

10 Jun, 2015

1 commit

  • Now blk_cleanup_queue() can be called before calling
    del_gendisk()[1], inside which hctx->ctxs is touched
    from blk_mq_unregister_hctx(), but the variable has
    been freed by blk_cleanup_queue() at that time.

    So this patch moves freeing of hctx->ctxs into queue's
    release handler for fixing the oops reported by Stefan.

    [1], 6cd18e711dd8075 (block: destroy bdi before blockdev is
    unregistered)

    Reported-by: Stefan Seyfried
    Cc: NeilBrown
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Jun, 2015

1 commit

  • Storage controllers may expose multiple block devices that share hardware
    resources managed by blk-mq. This patch enhances the shared tags so a
    low-level driver can access the shared resources not tied to the unshared
    h/w contexts. This way the LLD can dynamically add and delete disks and
    request queues without having to track all the request_queue hctx's to
    iterate outstanding tags.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 May, 2015

1 commit


09 May, 2015

4 commits

  • Last patch makes plug work for multiple queue case. However it only
    works for single disk case, because it assumes only one request in the
    plug list. If a task is accessing multiple disks, eg MD/DM, the
    assumption is wrong. Let blk_attempt_plug_merge() record request from
    the same queue.

    V2: use NULL parameter in !mq case. Fix a bug. Add comments in
    blk_attempt_plug_merge to make it less (hopefully) confusion.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • plug is still helpful for workload with IO merge, but it can be harmful
    otherwise especially with multiple hardware queues, as there is
    (supposed) no lock contention in this case and plug can introduce
    latency. For multiple queues, we do limited plug, eg plug only if there
    is request merge. If a request doesn't have merge with following
    request, the requet will be dispatched immediately.

    V2: check blk_queue_nomerges() as suggested by Jeff.

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • If we directly issue a request and it fails, we use
    blk_mq_merge_queue_io(). But we already assigned bio to a request in
    blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
    blk_mq_bio_to_request again.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The following appears in blk_sq_make_request:

    /*
    * If we have multiple hardware queues, just go directly to
    * one of those for sync IO.
    */

    We clearly don't have multiple hardware queues, here! This comment was
    introduced with this commit 07068d5b8e (blk-mq: split make request
    handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
    for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
    we want to go straight to hardware for SYNC IO.

    The old code had this:

    use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

    and that was converted to:

    use_plug = !is_flush_fua && !is_sync;

    which is not equivalent. For the single queue case, that second half of
    the && expression is always true. So, what I think was actually inteded
    follows (and this more closely matches what is done in blk_queue_bio).

    V2: delete the 'likely', which should not be a big deal

    Signed-off-by: Jeff Moyer
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

05 May, 2015

1 commit

  • Normally if driver is busy to dispatch a request the logic is like below:
    block layer: driver:
    __blk_mq_run_hw_queue
    a. blk_mq_stop_hw_queue
    b. rq add to ctx->dispatch

    later:
    1. blk_mq_start_hw_queue
    2. __blk_mq_run_hw_queue

    But it's possible step 1-2 runs between a and b. And since rq isn't in
    ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
    there are no subsequent requests kick in.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

24 Apr, 2015

1 commit

  • hctx->tags has to be set as NULL in case that it is to be unmapped
    no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
    because shared tags can be freed already from another request queue.

    The same situation has to be considered during handling CPU online too.
    Unmapped hw queue can be remapped after CPU topo is changed, so we need
    to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
    allocation for hw queue can be removed in hctx cpu online notifier, and it
    is reasonable to do that after mapping is updated.

    Cc:
    Reported-by: Dongsu Park
    Tested-by: Dongsu Park
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei