14 Dec, 2011

1 commit

  • There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
    macro and use it.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Oct, 2011

1 commit


19 Oct, 2011

4 commits

  • request_queue is refcounted but actually depdends on lifetime
    management from the queue owner - on blk_cleanup_queue(), block layer
    expects that there's no request passing through request_queue and no
    new one will.

    This is fundamentally broken. The queue owner (e.g. SCSI layer)
    doesn't have a way to know whether there are other active users before
    calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
    guarantee that the queue is and would stay valid while it's holding a
    reference.

    With delay added in blk_queue_bio() before queue_lock is grabbed, the
    following oops can be easily triggered when a device is removed with
    in-flight IOs.

    sd 0:0:1:0: [sdb] Stopping disk
    ata1.01: disabled
    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
    RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
    ...
    Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
    ...
    Call Trace:
    [] elv_merge+0x84/0xe0
    [] blk_queue_bio+0xf4/0x400
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x74/0x100
    [] dio_bio_submit+0xbc/0xc0
    [] __blockdev_direct_IO+0x92e/0xb40
    [] blkdev_direct_IO+0x57/0x60
    [] generic_file_aio_read+0x6d5/0x760
    [] do_sync_read+0xda/0x120
    [] vfs_read+0xc5/0x180
    [] sys_pread64+0x9a/0xb0
    [] system_call_fastpath+0x16/0x1b

    This happens because blk_queue_cleanup() destroys the queue and
    elevator whether IOs are in progress or not and DEAD tests are
    sprinkled in the request processing path without proper
    synchronization.

    Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
    is shutdown whether it has requests in it or not. Depending on
    timing, it either oopses or throttled bios are lost putting tasks
    which are waiting for bio completion into eternal D state.

    The way it should work is having the usual clear distinction between
    shutdown and release. Shutdown drains all currently pending requests,
    marks the queue dead, and performs partial teardown of the now
    unnecessary part of the queue. Even after shutdown is complete,
    reference holders are still allowed to issue requests to the queue
    although they will be immmediately failed. The rest of teardown
    happens on release.

    This patch makes the following changes to make blk_queue_cleanup()
    behave as proper shutdown.

    * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
    queue_lock.

    * Unsynchronized DEAD check in generic_make_request_checks() removed.
    This couldn't make any meaningful difference as the queue could die
    after the check.

    * blk_drain_queue() updated such that it can drain all requests and is
    now called during cleanup.

    * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
    drains all throttled bios during cleanup and free td when queue is
    released.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

    * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
    and drops queue_lock in the latter case. Different locking context
    depending on return value is error-prone and DEAD state is scheduled
    to be protected by queue_lock anyway. Move DEAD check inside
    queue_lock and return valid tg or NULL.

    * blk_throtl_bio() indicates return status both with its return value
    and in/out param **@bio. The former is used to indicate whether
    queue is found to be dead during throtl processing. The latter
    whether the bio is throttled.

    There's no point in returning DEAD check result from
    blk_throtl_bio(). The queue can die after blk_throtl_bio() is
    finished but before make_request_fn() grabs queue lock.

    Make it take *@bio instead and return boolean result indicating
    whether the request is throttled or not.

    This patch doesn't cause any visible functional difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
    completely bogus. The caller must have a reference to the queue on
    entry and taking an extra reference doesn't change anything.

    For scsi_cmd_ioctl(), the only effect is that it ends up checking
    QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
    right after blk_get_queue(). Dead queue should be and is handled in
    request issue path (it's somewhat broken now but that's a separate
    problem and doesn't affect this one much).

    throtl_get_tg() incorrectly assumes that q is rcu freed. Also, it
    doesn't check return value of blk_get_queue(). If the queue is
    already dead, it ends up doing an extra put.

    Drop them.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_throtl interface is block internal and there's no reason to have
    them in linux/blkdev.h. Move them to block/blk.h.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Aug, 2011

1 commit


14 Jun, 2011

2 commits


21 May, 2011

8 commits


16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

08 Apr, 2011

1 commit


06 Apr, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • Lina reported that if throttle limits are initially very high and then
    dropped, then no new bio might be dispatched for a long time. And the
    reason being that after dropping the limits we don't reset the existing
    slice and do the rate calculation with new low rate and account the bios
    dispatched at high rate. To fix it, reset the slice upon rate change.

    https://lkml.org/lkml/2011/3/10/298

    Another problem with very high limit is that we never queued the
    bio on throtl service tree. That means we kept on extending the
    group slice but never trimmed it. Fix that also by regulary
    trimming the slice even if bio is not being queued up.

    Reported-by: Lina Lu
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

10 Mar, 2011

3 commits


08 Mar, 2011

2 commits

  • When throttle group limits are updated through cgroups, a thread is
    woken up to process these updates. While reviewing that code, oleg noted
    couple of race conditions existed in the code and he also suggested that
    code can be simplified.

    This patch fixes the races simplifies the code based on Oleg's suggestions:

    - Use xchg().
    - Introduced a common function throtl_update_blkio_group_common()
    which is shared now by all iops/bps update functions.

    Reviewed-by: Oleg Nesterov
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Vivek Goyal

    Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data
    as the argument now, not the queue.

    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • With the help of cgroup interface one can go and upate the bps/iops
    limits of existing group. Once the limits are udpated, a thread is
    woken up to see if some blocked group needs recalculation based on new
    limits and needs to be requeued.

    There was also a piece of code where I was checking for group limit
    update when a fresh bio comes in. This patch gets rid of that piece of
    code and keeps processing the limit change at one place
    throtl_process_limit_change(). It just keeps the code simple and easy
    to understand.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

03 Mar, 2011

1 commit

  • Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
    written in such a way that it needs queue lock. In blk_release_queue()
    there is no gurantee that ->queue_lock is still around.

    Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
    one problem.

    https://lkml.org/lkml/2010/10/23/86

    And a quick fix moved blk_throtl_exit() to blk_release_queue().

    commit 7ad58c028652753814054f4e3ac58f925e7343f4
    Author: Jens Axboe
    Date: Sat Oct 23 20:40:26 2010 +0200

    block: fix use-after-free bug in blk throttle code

    This patch reverts above change and does not try to shutdown the
    throtl work in blk_sync_queue(). By avoiding call to
    throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
    the problem reported by Ingo.

    blk_sync_queue() seems to be used only by md driver and it seems to be
    using it to make sure q->unplug_fn is not called as md registers its
    own unplug functions and it is about to free up the data structures
    used by unplug_fn(). Block throttle does not call back into unplug_fn()
    or into md. So there is no need to cancel blk throttle work.

    In fact I think cancelling block throttle work is bad because it might
    happen that some bios are throttled and scheduled to be dispatched later
    with the help of pending work and if work is cancelled, these bios might
    never be dispatched.

    Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
    blk_release_queue() time. That should be safe as we are also calling
    blk_throtl_exit() which should make sure all the throttling related
    data structures are cleaned up.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

02 Mar, 2011

1 commit

  • o Dominik Klein reported a system hang issue while doing some blkio
    throttling testing.

    https://lkml.org/lkml/2011/2/24/173

    o Some tracing revealed that CFQ was not dispatching any more jobs as
    queue unplug was not happening. And queue unplug was not happening
    because unplug work was not being called as there was one throttling
    work on same cpu which as not finished yet. And throttling work had not
    finished as it was tyring to dispatch a bio to CFQ but all the request
    descriptors were consume to it was put to sleep.

    o So basically it is a cyclic dependecny between CFQ unplug work and
    throtl dispatch work. Tejun suggested that use separate workqueue for
    such cases.

    o This patch uses a separate workqueue for throttle related work and
    does not rely on kblockd workqueue anymore.

    Cc: stable@kernel.org
    Reported-by: Dominik Klein
    Signed-off-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

19 Jan, 2011

1 commit

  • o Jeff Moyer was doing some testing on a RAM backed disk and
    blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
    somebody else reported that blkiocg_lookup_group() is eating 6% extra
    cpu. Though looking at the code I can't think why the overhead of
    this function is so high. One thing is that it is called with very high
    frequency (once for every IO).

    o For lot of folks blkio controller will be compiled in but they might
    not have actually created cgroups. Hence optimize the case of root
    cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
    in root group (common case).

    Reported-by: Jeff Moyer
    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

02 Dec, 2010

2 commits

  • o I was discussing what are the variable being updated without spin lock and
    why do we need barriers and Oleg pointed out that location of smp_rmb()
    should be between read of td->limits_changed and tg->limits_changed. This
    patch fixes it.

    o Following is one possible sequence of events. Say cpu0 is executing
    throtl_update_blkio_group_read_bps() and cpu1 is executing
    throtl_process_limit_change().

    cpu0 cpu1

    tg->limits_changed = true;
    smp_mb__before_atomic_inc();
    atomic_inc(&td->limits_changed);

    if (!atomic_read(&td->limits_changed))
    return;

    if (tg->limits_changed)
    do_something;

    If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
    make sure that if update to td->limits_changed is visible on cpu1, then
    update to tg->limits_changed should also be visible.

    Oleg pointed out to ensure that we need to insert an smp_rmb() between
    td->limits_changed read and tg->limits_changed read.

    o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
    This patch fixes it.

    Reported-by: Oleg Nesterov
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o During some testing I did following and noticed throttling stops working.

    - Put a very low limit on a cgroup, say 1 byte per second.
    - Start some reads, this will set slice_end to a very high value.
    - Change the limit to higher value say 1MB/s
    - Now IO unthrottles and finishes as expected.
    - Try to do the read again but IO is not limited to 1MB/s as expected.

    o What is happening.
    - Initially low value of limit sets slice_end to a very high value.
    - During updation of limit, slice_end is not being truncated.
    - Very high value of slice_end leads to keeping the existing slice
    valid for a very long time and new slice does not start.
    - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
    is not called in this path. So slice_start is some old value and
    practically we are able to do huge amount of IO.

    o There are many ways it can be fixed. I have fixed it by trying to
    adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
    is big and can't be dispatched in one slice. After dispatch of bio, readjust
    the slice_end to make sure we don't end up with huge values.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 Nov, 2010

1 commit


02 Oct, 2010

2 commits


01 Oct, 2010

3 commits

  • o Randy Dunlap reported following linux-next failure. This patch fixes it.

    on i386:

    blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
    blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'

    o bytes_per_second interface is 64bit and I was continuing to do 64 bit
    division even on 32bit platform without help of special macros/functions
    hence the failure.

    Signed-off-by: Vivek Goyal
    Reported-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Currently any cgroup throttle limit changes are processed asynchronousy and
    the change does not take affect till a new bio is dispatched from same group.

    o It might happen that a user sets a redicuously low limit on throttling.
    Say 1 bytes per second on reads. In such cases simple operations like mount
    a disk can wait for a very long time.

    o Once bio is throttled, there is no easy way to come out of that wait even if
    user increases the read limit later.

    o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
    the bio dispatch time according to new limits.

    o Can't take queueu lock under blkcg_lock, hence after the change I wake
    up the dispatch thread again which recalculates the time. So there are some
    variables being synchronized across two threads without lock and I had to
    make use of barriers. Hoping I have used barriers correctly. Any review of
    memory barrier code especially will help.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Currently all the dynamically allocated groups, except root grp is added
    to td->tg_list. This was not a problem so far but in next patch I will
    travel through td->tg_list to process any updates of limits on the group.
    If root group is not in tg_list, then root group's updates are not
    processed.

    o It is better to root group also to tg_list instead of doing special
    processing for it during limit updates.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 Sep, 2010

1 commit