15 Jun, 2012

2 commits

  • Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
    supplied queue_lock before blk_drain_queue(). Switching the lock would
    introduce lock unbalance because theads which have taken the external
    lock might unlock the internal lock in the during the queue drain. This
    patch mitigate this by disconnecting the lock after the queue draining
    since queue draining makes a lot of request_queue users go away.

    However, please note, this patch only makes the problem less likely to
    happen. Anyone who still holds a ref might try to issue a new request on
    a dead queue after the blk_cleanup_queue() finishes draining, the lock
    unbalance might still happen in this case.

    =====================================
    [ BUG: bad unlock balance detected! ]
    3.4.0+ #288 Not tainted
    -------------------------------------
    fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
    [] blk_queue_bio+0x2a2/0x380
    but there are no more locks to release!

    other info that might help us debug this:
    1 lock held by fio/17706:
    #0: (&(&vblk->lock)->rlock){......}, at: []
    get_request_wait+0x19a/0x250

    stack backtrace:
    Pid: 17706, comm: fio Not tainted 3.4.0+ #288
    Call Trace:
    [] ? blk_queue_bio+0x2a2/0x380
    [] print_unlock_inbalance_bug+0xf9/0x100
    [] lock_release_non_nested+0x1df/0x330
    [] ? dio_bio_end_aio+0x34/0xc0
    [] ? bio_check_pages_dirty+0x85/0xe0
    [] ? dio_bio_end_aio+0xb1/0xc0
    [] ? blk_queue_bio+0x2a2/0x380
    [] ? blk_queue_bio+0x2a2/0x380
    [] lock_release+0xd9/0x250
    [] _raw_spin_unlock_irq+0x23/0x40
    [] blk_queue_bio+0x2a2/0x380
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x76/0xf0
    [] ? set_page_dirty_lock+0x3c/0x60
    [] ? bio_set_pages_dirty+0x51/0x70
    [] do_blockdev_direct_IO+0xbf8/0xee0
    [] ? blkdev_get_block+0x80/0x80
    [] __blockdev_direct_IO+0x55/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] blkdev_direct_IO+0x57/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] generic_file_aio_read+0x70e/0x760
    [] ? __lock_acquire+0x215/0x5a0
    [] ? aio_run_iocb+0x54/0x1a0
    [] ? grab_cache_page_nowait+0xc0/0xc0
    [] aio_rw_vect_retry+0x7c/0x1e0
    [] ? aio_fsync+0x30/0x30
    [] aio_run_iocb+0x66/0x1a0
    [] do_io_submit+0x6f0/0xb80
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] sys_io_submit+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Changes since v2: Update commit log to explain how the code is still
    broken even if we delay the lock switching after the drain.
    Changes since v1: Update commit log as Tejun suggested.

    Acked-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • After hot-unplug a stressed disk, I found that rl->wait[] is not empty
    while rl->count[] is empty and there are theads still sleeping on
    get_request after the queue cleanup. With simple debug code, I found
    there are exactly nr_sleep - nr_wakeup of theads in D state. So there
    are missed wakeup.

    $ dmesg | grep nr_sleep
    [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
    $ vmstat 1
    1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

    To quote Tejun:

    Ah, okay, freed_request() wakes up single waiter with the assumption
    that after the wakeup there will at least be one successful allocation
    which in turn will continue the wakeup chain until the wait list is
    empty - ie. waiter wakeup is dependent on successful request
    allocation happening after each wakeup. With queue marked dead, any
    woken up waiter fails the allocation path, so the wakeup chaining is
    lost and we're left with hung waiters. What we need is wake_up_all()
    after drain completion.

    This patch fixes the missed wakeup by waking up all the theads which
    are sleeping on wait queue after queue drain.

    Changes in v2: Drop waitqueue_active() optimization

    Acked-by: Tejun Heo
    Signed-off-by: Asias He

    Fixed a bug by me, where stacked devices would oops on calling
    blk_drain_queue() since ->rq.wait[] do not get initialized unless
    it's a full queue setup.

    Signed-off-by: Jens Axboe

    Asias He
     

01 May, 2012

1 commit


20 Apr, 2012

4 commits

  • Request allocation is mempool backed to guarantee forward progress
    under memory pressure; unfortunately, this property got broken while
    adding elvpriv data. Failures during elvpriv allocation, including
    ioc and icq creation failures, currently make get_request() fail as
    whole. There's no forward progress guarantee for these allocations -
    they may fail indefinitely under memory pressure stalling IO and
    deadlocking the system.

    This patch updates get_request() such that elvpriv allocation failure
    doesn't make the whole function fail. If elvpriv allocation fails,
    the allocation is degraded into !ELVPRIV. This will force the request
    to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
    failures should be rare (nothing is per-request) and anything is
    better than deadlocking.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Allocation failure handling in get_request() is about to be updated.
    To ease the update, collapse blk_alloc_request() into get_request().

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the previous change to guarantee bypass visiblity for RCU read
    lock regions, entering bypass mode involves non-trivial overhead and
    future changes are scheduled to make use of bypass mode during init
    path. Combined it may end up adding noticeable delay during boot.

    This patch makes request_queue start its life in bypass mode, which is
    ended on queue init completion at the end of
    blk_init_allocated_queue(), and updates blk_queue_bypass_start() such
    that draining and RCU synchronization are performed only when the
    queue actually enters bypass mode.

    This avoids unnecessarily switching in and out of bypass mode during
    init avoiding the overhead and any nasty surprises which may step from
    leaving bypass mode on half-initialized queues.

    The boot time overhead was pointed out by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_lookup() doesn't check @q bypass state. This patch
    updates blk_queue_bypass_start() to do synchronize_rcu() before
    returning and updates blkg_lookup() to check blk_queue_bypass() and
    return %NULL if bypassing. This ensures blkg_lookup() returns %NULL
    if @q is bypassing.

    This is to guarantee that nobody is accessing policy data while @q is
    bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in
    place on policy [de]activation.

    v2: Added more comments explaining bypass guarantees as suggested by
    Vivek.

    v3: Added more comments explaining why there's no synchronize_rcu() in
    blk_cleanup_queue() as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Apr, 2012

1 commit

  • We do auto block plug flush to reduce latency, the threshold is 16
    requests. This works well if the task is accessing one or two drives.
    The problem is if the task is accessing a raid 0 device and the raid
    disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
    heavy lock contention.

    This patch makes the threshold per-disk based. The latency should be
    still ok accessing one or two drives. The setup with application
    accessing a lot of drives in the meantime uaually is big machine,
    avoiding lock contention is more important, because any contention
    will actually increase latency.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

23 Mar, 2012

1 commit


07 Mar, 2012

10 commits

  • IO scheduling and cgroup are tied to the issuing task via io_context
    and cgroup of %current. Unfortunately, there are cases where IOs need
    to be routed via a different task which makes scheduling and cgroup
    limit enforcement applied completely incorrectly.

    For example, all bios delayed by blk-throttle end up being issued by a
    delayed work item and get assigned the io_context of the worker task
    which happens to serve the work item and dumped to the default block
    cgroup. This is double confusing as bios which aren't delayed end up
    in the correct cgroup and makes using blk-throttle and cfq propio
    together impossible.

    Any code which punts IO issuing to another task is affected which is
    getting more and more common (e.g. btrfs). As both io_context and
    cgroup are firmly tied to task including userland visible APIs to
    manipulate them, it makes a lot of sense to match up tasks to bios.

    This patch implements bio_associate_current() which associates the
    specified bio with %current. The bio will record the associated ioc
    and blkcg at that point and block layer will use the recorded ones
    regardless of which task actually ends up issuing the bio. bio
    release puts the associated ioc and blkcg.

    It grabs and remembers ioc and blkcg instead of the task itself
    because task may already be dead by the time the bio is issued making
    ioc and blkcg inaccessible and those are all block layer cares about.

    elevator_set_req_fn() is updated such that the bio elvdata is being
    allocated for is available to the elevator.

    This doesn't update block cgroup policies yet. Further patches will
    implement the support.

    -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
    rq_ioc() to fix build breakage.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Make the following interface updates to prepare for future ioc related
    changes.

    * create_io_context() returning ioc only works for %current because it
    doesn't increment ref on the ioc. Drop @task parameter from it and
    always assume %current.

    * Make create_io_context_slowpath() return 0 or -errno and rename it
    to create_task_io_context().

    * Make ioc_create_icq() take @ioc as parameter instead of assuming
    that of %current. The caller, get_request(), is updated to create
    ioc explicitly and then pass it into ioc_create_icq().

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • get_request() is structured a bit unusually in that failure path is
    inlined in the usual flow with goto labels atop and inside it.
    Relocate the error path to the end of the function.

    This is to prepare for icq handling changes in get_request() and
    doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg is per cgroup-queue-policy combination. This is
    unnatural and leads to various convolutions in partially used
    duplicate fields in blkg, config / stat access, and general management
    of blkgs.

    This patch make blkg's per cgroup-queue and let them serve all
    policies. blkgs are now created and destroyed by blkcg core proper.
    This will allow further consolidation of common management logic into
    blkcg core and API with better defined semantics and layering.

    As a transitional step to untangle blkg management, elvswitch and
    policy [de]registration, all blkgs except the root blkg are being shot
    down during elvswitch and bypass. This patch adds blkg_root_update()
    to update root blkg in place on policy change. This is hacky and racy
    but should be good enough as interim step until we get locking
    simplified and switch over to proper in-place update for all blkgs.

    -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
    comment wasn't updated according to the function change. Fixed.
    Both pointed out by Vivek.

    -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
    all policies. This freed root pd during elvswitch before the
    last queue finished exiting and led to oops. Directly invoke
    update_root_blkg_pd() only on BLKIO_POLICY_PROP from
    cfq_exit_queue(). This also is closer to what will be done with
    proper in-place blkg update. Reported by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, specific policy implementations are responsible for
    maintaining list and number of blkgs. This duplicates code
    unnecessarily, and hinders factoring common code and providing blkcg
    API with better defined semantics.

    After this patch, request_queue hosts list heads and counters and blkg
    has list nodes for both policies. This patch only relocates the
    necessary fields and the next patch will actually move management code
    into blkcg core.

    Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
    have 2 elements. This is to avoid include dependency and will be
    removed by the next patch.

    This patch doesn't introduce any behavior change.

    -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
    as pointed out by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently block core calls directly into blk-throttle for init, drain
    and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
    the blk-throttle functions. This is to give more control and
    visiblity to blkcg core layer for proper layering. Further patches
    will add logic common to blkcg policies to the functions.

    While at it, collapse blk_throtl_release() into blk_throtl_exit().
    There's no reason to keep them separate.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For root blkg, blk_throtl_init() was using throtl_alloc_tg()
    explicitly and cfq_init_queue() was manually initializing embedded
    cfqd->root_group, adding unnecessarily different code paths to blkg
    handling.

    Make both use the usual blkio_group get functions - throtl_get_tg()
    and cfq_get_cfqg() - for the root blkio_group too. Note that
    blk_throtl_init() callsite is pushed downwards in
    blk_alloc_queue_node() so that @q is sufficiently initialized for
    throtl_get_tg().

    This simplifies root blkg handling noticeably for cfq and will allow
    further modularization of blkcg API.

    -v2: Vivek pointed out that using cfq_get_cfqg() won't work if
    CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out
    initialization of base part of cfqg into cfq_init_cfqg_base() and
    alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Extend queue bypassing such that dying queue is always bypassing and
    blk-throttle is drained on bypass. With blkcg policies updated to
    test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
    no bio or request is held by or going through blkcg policies on a
    bypassing queue.

    This will be used to implement blkg cleanup on elevator switches and
    policy changes.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Rename and extend elv_queisce_start/end() to
    blk_queue_bypass_start/end() which are exported and supports nesting
    via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
    state.

    This will be further extended and used for blkio_group management.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
    q->elevator doesn't exist; however, bio based drivers don't have
    elevator initialized but can still use blk-throttle. This patch moves
    q->elevator test inside blk_drain_queue() such that only
    elv_drain_elevator() is skipped if !q->elevator.

    -v2: loop can have registered queue which has NULL request_fn. Make
    sure we don't call into __blk_run_queue() in such cases.

    Signed-off-by: Tejun Heo
    Reported-by: Vivek Goyal

    Fold in bug fix from Vivek.

    Signed-off-by: Jens Axboe

    Tejun Heo
     

08 Feb, 2012

2 commits

  • Plug merge calls two elevator callbacks outside queue lock -
    elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
    attempt_plug_merge() suggests that elevator is guaranteed to be there
    through the existing request on the plug list, nothing prevents plug
    merge from calling into dying or initializing elevator.

    For regular merges, bypass ensures elvpriv count to reach zero, which
    in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
    from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
    the requests haven't gone through elevator insertion yet, it doesn't
    have SOFTBARRIER set allowing merges on a bypassed queue.

    This, for example, leads to the following crash during elevator
    switch.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] cfq_allow_merge+0x49/0xa0
    PGD 112cbc067 PUD 115d5c067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    CPU 1
    Modules linked in: deadline_iosched

    Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
    RIP: 0010:[] [] cfq_allow_merge+0x49/0xa0
    RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
    RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
    RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
    RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
    R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
    FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
    Stack:
    ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
    ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
    ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
    Call Trace:
    [] elv_rq_merge_ok+0x4a/0x60
    [] elv_try_merge+0x21/0x40
    [] blk_queue_bio+0x8e/0x390
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x74/0x100
    [] __blockdev_direct_IO+0x1ce2/0x3450
    [] blkdev_direct_IO+0x57/0x60
    [] generic_file_aio_read+0x6d5/0x760
    [] do_sync_read+0xe2/0x120
    [] vfs_read+0xc5/0x180
    [] sys_read+0x51/0x90
    [] system_call_fastpath+0x16/0x1b

    There are multiple ways to fix this including making plug merge check
    ELVPRIV; however,

    * Calling into elevator outside queue lock is confusing and
    error-prone.

    * Requests on plug list aren't known to the elevator. They aren't on
    the elevator yet, so there's no elevator specific state to update.

    * Given the nature of plug merges - collecting bio's for the same
    purpose from the same issuer - elevator specific restrictions aren't
    applicable.

    So, simply don't call into elevator methods from plug merge by moving
    elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
    using blk_try_merge() in attempt_plug_merge().

    This is based on Jens' patch to skip elevator_allow_merge_fn() from
    plug merge.

    Note that this makes per-cgroup merged stats skip plug merging.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Original-patch-by: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
    test. blk_try_merge() determines merge direction and expects the
    caller to have tested elv_rq_merge_ok() previously.

    elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
    elv_iosched_allow_merge(). elv_try_merge() is removed and the two
    callers are updated to call elv_rq_merge_ok() explicitly followed by
    blk_try_merge(). While at it, make rq_merge_ok() functions return
    bool.

    This is to prepare for plug merge update and doesn't introduce any
    behavior change.

    This is based on Jens' patch to skip elevator_allow_merge_fn() from
    plug merge.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Original-patch-by: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Feb, 2012

1 commit

  • put_io_context() performed a complex trylock dancing to avoid
    deferring ioc release to workqueue. It was also broken on UP because
    trylock was always assumed to succeed which resulted in unbalanced
    preemption count.

    While there are ways to fix the UP breakage, even the most
    pathological microbench (forced ioc allocation and tight fork/exit
    loop) fails to show any appreciable performance benefit of the
    optimization. Strip it out. If there turns out to be workloads which
    are affected by this change, simpler optimization from the discussion
    thread can be applied later.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Jan, 2012

1 commit

  • Vivek reported a kernel crash:
    [ 94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
    [ 94.218004] IP: [] kmem_cache_free+0x5e/0x200
    [ 94.218004] PGD 13abda067 PUD 137d52067 PMD 0
    [ 94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 94.218004] CPU 0
    [ 94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
    [ 94.218004]
    [ 94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
    [ 94.218004] RIP: 0010:[] [] kmem_cache_free+0x5e/0x200
    [ 94.218004] RSP: 0018:ffff88013fc03de0 EFLAGS: 00010006
    [ 94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
    [ 94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
    [ 94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
    [ 94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [ 94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
    [ 94.218004] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
    [ 94.218004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
    [ 94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
    [ 94.218004] Stack:
    [ 94.218004] 0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
    [ 94.218004] 0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
    [ 94.218004] ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
    [ 94.218004] Call Trace:
    [ 94.218004]
    [ 94.218004] [] icq_free_icq_rcu+0x16/0x20
    [ 94.218004] [] __rcu_process_callbacks+0x1c2/0x420
    [ 94.218004] [] rcu_process_callbacks+0x38/0x250
    [ 94.218004] [] __do_softirq+0xce/0x3e0
    [ 94.218004] [] ? clockevents_program_event+0x74/0x100
    [ 94.218004] [] ? tick_program_event+0x24/0x30
    [ 94.218004] [] call_softirq+0x1c/0x30
    [ 94.218004] [] do_softirq+0x8d/0xc0
    [ 94.218004] [] irq_exit+0xae/0xe0
    [ 94.218004] [] smp_apic_timer_interrupt+0x6e/0x99
    [ 94.218004] [] apic_timer_interrupt+0x70/0x80

    Once a queue is quiesced, it's not supposed to have any elvpriv data or
    icq's, and elevator switching depends on that. Request alloc path
    followed the rule for elvpriv data but forgot apply it to icq's
    leading to the following crash during elevator switch. Fix it by not
    allocating icq's if ELVPRIV is not set for the request.

    Reported-by: Vivek Goyal
    Tested-by: Vivek Goyal
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     

16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

16 Dec, 2011

1 commit

  • While probing, fd sets up queue, probes hardware and tears down the
    queue if probing fails. In the process, blk_drain_queue() kicks the
    queue which failed to finish initialization and fd is unhappy about
    that.

    floppy0: no floppy controllers found
    ------------[ cut here ]------------
    WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
    Hardware name: To Be Filled By O.E.M.
    VFS: do_fd_request called on non-open device
    Modules linked in:
    Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
    Call Trace:
    [] warn_slowpath_common+0x7a/0xb0
    [] warn_slowpath_fmt+0x41/0x50
    [] do_fd_request+0xbf/0xd0
    [] blk_drain_queue+0x65/0x80
    [] blk_cleanup_queue+0xe3/0x1a0
    [] floppy_init+0xdeb/0xe28
    [] ? daring+0x6b/0x6b
    [] do_one_initcall+0x3f/0x170
    [] kernel_init+0x9d/0x11e
    [] ? schedule_tail+0x22/0xa0
    [] kernel_thread_helper+0x4/0x10
    [] ? start_kernel+0x2be/0x2be
    [] ? gs_change+0xb/0xb

    Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
    something on it.

    Signed-off-by: Tejun Heo
    Reported-by: Ralf Hildebrandt
    Reported-by: Wu Fengguang
    Tested-by: Sergei Trofimovich
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Dec, 2011

9 commits

  • Now block layer knows everything necessary to create and associate
    icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
    get_request() such that, if elevator_type->icq_size is set, requests
    are automatically associated with their matching icq's before
    elv_set_request(). io_context reference is also managed by block core
    on request alloc/free.

    * Only ioprio/cgroup changed handling remains from cfq_get_cic().
    Collapsed into cfq_set_request().

    * This removes queue kicking on icq allocation failure (for now). As
    icq allocation failure is rare and the only effect of queue kicking
    achieved was possibily accelerating queue processing, this change
    shouldn't be noticeable.

    There is a larger underlying problem. Unlike request allocation,
    icq allocation is not guaranteed to succeed eventually after
    retries. The number of icq is unbound and thus mempool can't be the
    solution either. This effectively adds allocation dependency on
    memory free path and thus possibility of deadlock.

    This usually wouldn't happen because icq allocation is not a hot
    path and, even when the condition triggers, it's highly unlikely
    that none of the writeback workers already has icq.

    However, this is still possible especially if elevator is being
    switched under high memory pressure, so we better get it fixed.
    Probably the only solution is just bypassing elevator and appending
    to dispatch queue on any elevator allocation failure.

    * Comment added to explain how icq's are managed and synchronized.

    This completes cleanup of io_context interface.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Most of icq management is about to be moved out of cfq into blk-ioc.
    This patch prepares for it.

    * Move cfqd->icq_list to request_queue->icq_list

    * Make request explicitly point to icq instead of through elevator
    private data. ->elevator_private[3] is replaced with sub struct elv
    which contains icq pointer and priv[2]. cfq is updated accordingly.

    * Meaningless clearing of ->elevator_private[0] removed from
    elv_set_request(). At that point in code, the field was guaranteed
    to be %NULL anyway.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When called under queue_lock, current_io_context() triggers lockdep
    warning if it hits allocation path. This is because io_context
    installation is protected by task_lock which is not IRQ safe, so it
    triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
    deadlock warning.

    Given the restriction, accessor + creator rolled into one doesn't work
    too well. Drop current_io_context() and let the users access
    task->io_context directly inside queue_lock combined with explicit
    creation using create_io_context().

    Future ioc updates will further consolidate ioc access and the create
    interface will be unexported.

    While at it, relocate ioc internal interface declarations in blk.h and
    add section comments before and after.

    This patch does not introduce functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
    failure instead of 0 / -errno or boolean. Update it such that it
    returns %true on success and %false on failure.

    * Make sure the caller checks for the return value.

    * Separate out __blk_get_queue() which doesn't check whether @q is
    dead and put it in blk.h. This will be used later.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq allocates per-queue id using ida and uses it to index cic radix
    tree from io_context. Move it to q->id and allocate on queue init and
    free on queue release. This simplifies cfq a bit and will allow for
    further improvements of io context life-cycle management.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_insert_cloned_request(), blk_execute_rq_nowait() and
    blk_flush_plug_list() either didn't check whether the queue was dead
    or did it without holding queue_lock. Update them so that dead state
    is checked while holding queue_lock.

    AFAICS, this plugs all holes (requeue doesn't matter as the request is
    transitioning atomically from in_flight to queued).

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When trying to drain all requests, blk_drain_queue() checked only
    q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
    patch updates blk_drain_queue() such that it looks at all the counters
    and queues so that request_queue is actually empty on completion.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
    macro and use it.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The only user left for blk_insert_request() is sx8 and it can be
    trivially switched to use blk_execute_rq_nowait() - special requests
    aren't included in io stat and sx8 doesn't use block layer tagging.
    Switch sx8 and kill blk_insert_requeset().

    This patch doesn't introduce any functional difference.

    Only compile tested.

    Signed-off-by: Tejun Heo
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Nov, 2011

1 commit

  • struct request_queue is allocated with __GFP_ZERO so its "node" field is
    zero before initialization. This causes an oops if node 0 is offline in
    the page allocator because its zonelists are not initialized. From Dave
    Young's dmesg:

    SRAT: Node 1 PXM 2 0-d0000000
    SRAT: Node 1 PXM 2 100000000-330000000
    SRAT: Node 0 PXM 1 330000000-630000000
    Initmem setup node 1 0000000000000000-000000000affb000
    ...
    Built 1 zonelists in Node order, mobility grouping on.
    ...
    BUG: unable to handle kernel paging request at 0000000000001c08
    IP: [] __alloc_pages_nodemask+0xb5/0x870

    and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
    zonelist->_zonerefs.

    The fix is to initialize q->node at the time of allocation so the correct
    node is passed to the slab allocator later.

    Since blk_init_allocated_queue_node() is no longer needed, merge it with
    blk_init_allocated_queue().

    [rientjes@google.com: changelog, initializing q->node]
    Cc: stable@vger.kernel.org [2.6.37+]
    Reported-by: Dave Young
    Signed-off-by: Mike Snitzer
    Signed-off-by: David Rientjes
    Tested-by: Dave Young
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

16 Nov, 2011

2 commits


04 Nov, 2011

1 commit

  • blk_cleanup_queue() may be called before elevator is set up on a
    queue which triggers the following oops.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] elv_drain_elevator+0x1c/0x70
    ...
    Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
    Bochs Bochs
    RIP: 0010:[] [] elv_drain_elevator+0x1c/0x70
    ...
    Call Trace:
    [] blk_drain_queue+0x42/0x70
    [] blk_cleanup_queue+0xd0/0x1c0
    [] md_free+0x50/0x70
    [] kobject_release+0x8b/0x1d0
    [] kref_put+0x36/0xa0
    [] kobject_put+0x27/0x60
    [] mddev_delayed_delete+0x2f/0x40
    [] process_one_work+0x100/0x3b0
    [] worker_thread+0x15f/0x3a0
    [] kthread+0x87/0x90
    [] kernel_thread_helper+0x4/0x10

    Fix it by making blk_cleanup_queue() check whether q->elevator is set
    up before invoking blk_drain_queue.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Jiri Slaby
    Signed-off-by: Jens Axboe

    Tejun Heo
     

24 Oct, 2011

2 commits

  • Jens Axboe
     
  • A dm-multipath user reported[1] a problem when trying to boot
    a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
    (block: fix flush machinery for stacking drivers with differring
    flush flags) applied. It turns out that an empty flush request
    can be sent into blk_insert_flush. When the BUG_ON was fixed
    to allow for this, I/O on the underlying device would stall. The
    reason is that blk_insert_cloned_request does not kick the queue.
    In the aforementioned commit, I had added a special case to
    kick the queue if data was sent down but the queue flags did
    not require a flush. A better solution is to push the queue
    kick up into blk_insert_cloned_request.

    This patch, along with a follow-on which fixes the BUG_ON, fixes
    the issue reported.

    [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

    Reported-by: Christophe Saout
    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo

    Stable note: 3.1
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Jeff Moyer