27 Jun, 2011

2 commits


13 Jun, 2011

1 commit


06 Jun, 2011

1 commit

  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2011

1 commit


01 Jun, 2011

1 commit


24 May, 2011

4 commits


23 May, 2011

1 commit


21 May, 2011

4 commits

  • Currently we take blkg_stat lock for even updating the stats. So even if
    a group has no throttling rules (common case for root group), we end
    up taking blkg_lock, for updating the stats.

    Make dispatch stats per cpu so that these can be updated without taking
    blkg lock.

    If cpu goes offline, these stats simply disappear. No protection has
    been provided for that yet. Do we really need anything for that?

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently, all the cfq_group or throtl_group allocations happen while
    we are holding ->queue_lock and sleeping is not allowed.

    Soon, we will move to per cpu stats and also need to allocate the
    per group stats. As one can not call alloc_percpu() from atomic
    context as it can sleep, we need to drop ->queue_lock, allocate the
    group, retake the lock and continue processing.

    In throttling code, I check the queue DEAD flag again to make sure
    that driver did not call blk_cleanup_queue() in the mean time.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • blkg->key = cfqd is an rcu protected pointer and hence we used to do
    call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

    The problem here is that even though cfqd is around, there are no
    gurantees that associated request queue (td->queue) or q->queue_lock
    is still around. A driver might have called blk_cleanup_queue() and
    release the lock.

    It might happen that after freeing up the lock we call
    blkg->key->queue->queue_ock and crash. This is possible in following
    path.

    blkiocg_destroy()
    blkio_unlink_group_fn()
    cfq_unlink_blkio_group()

    Hence, wait for an rcu peirod if there are groups which have not
    been unlinked from blkcg->blkg_list. That way, if there are any groups
    which are taking cfq_unlink_blkio_group() path, can safely take queue
    lock.

    This is how we have taken care of race in throttling logic also.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
    Get rid of that.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

19 Apr, 2011

1 commit


18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Mar, 2011

1 commit


23 Mar, 2011

3 commits

  • Removing think time checking. A high thinktime queue might means the queue
    dispatches several requests and then do away. Limitting such queue seems
    meaningless. And also this can simplify code. This is suggested by Vivek.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Li, Shaohua
     
  • For v2, I added back lines to cfq_preempt_queue() that were removed
    during updates for accounting unaccounted_time. Thanks for pointing out
    that I'd missed these, Vivek.

    Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly
    cleared stats for preempting queues when it shouldn't have, because when
    we choose a queue to preempt, it still isn't necessarily scheduled next.

    Thanks to Vivek Goyal for figuring this out and understanding how the
    preemption code works.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Commit "Add unaccounted time to timeslice_used" changed the behavior of
    cfq_preempt_queue to set cfqq active. Vivek pointed out that other
    preemption rules might get involved, so we shouldn't manually set which
    queue is active.

    This cleans up the code to just clear the queue stats at preemption
    time.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

17 Mar, 2011

1 commit

  • Version 3 is updated to apply to for-2.6.39/core.

    For version 2, I took Vivek's advice and made sure we update the group
    weight from cfq_group_service_tree_add().

    If a weight was updated while a group is on the service tree, the
    calculation for the total weight of the service tree can be adjusted
    improperly, which either leads to bad service tree weights, or
    potentially crashes (if total_weight becomes 0).

    This patch defers updates to the weight until a group is off the service
    tree.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

12 Mar, 2011

1 commit

  • There are two kind of times that tasks are not charged for: the first
    seek and the extra time slice used over the allocated timeslice. Both
    of these exported as a new unaccounted_time stat.

    I think it would be good to have this reported in 'time' as well, but
    that is probably a separate discussion.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

10 Mar, 2011

2 commits


07 Mar, 2011

4 commits


05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

02 Mar, 2011

3 commits

  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Effectively, make group_isolation=1 the default and remove the tunable.
    The setting group_isolation=0 was because by default we idle on
    sync-noidle tree and on fast devices, this can be very harmful for
    throughput.

    However, this problem can also be addressed by tuning slice_idle and
    possibly group_idle on faster storage devices.

    This change simplifies the CFQ code by removing the feature entirely.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Conflicts:
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Feb, 2011

1 commit

  • Flush requests are never put on the IO scheduler. Convert request
    structure's elevator_private* into an array and have the flush fields
    share a union with it.

    Reclaim the space lost in 'struct request' by moving 'completion_data'
    back in the union with 'rb_node'.

    Signed-off-by: Mike Snitzer
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

09 Feb, 2011

1 commit

  • Commit 7667aa0630407bc07dc38dcc79d29cc0a65553c1 added logic to wait for
    the last queue of the group to become busy (have at least one request),
    so that the group does not lose out for not being continuously
    backlogged. The commit did not check for the condition that the last
    queue already has some requests. As a result, if the queue already has
    requests, wait_busy is set. Later on, cfq_select_queue() checks the
    flag, and decides that since the queue has a request now and wait_busy
    is set, the queue is expired. This results in early expiration of the
    queue.

    This patch fixes the problem by adding a check to see if queue already
    has requests. If it does, wait_busy is not set. As a result, time slices
    do not expire early.

    The queues with more than one request are usually buffered writers.
    Testing shows improvement in isolation between buffered writers.

    Cc: stable@kernel.org
    Signed-off-by: Justin TerAvest
    Reviewed-by: Gui Jianfeng
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

19 Jan, 2011

1 commit


14 Jan, 2011

3 commits

  • If a queue is preempted before it gets slice assigned, the queue doesn't get
    compensation, which looks unfair. For such queue, we compensate it for a whole
    slice.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I got this:
    fio-874 [007] 2157.724514: 8,32 m N cfq874 preempt
    fio-874 [007] 2157.724519: 8,32 m N cfq830 slice expired t=1
    fio-874 [007] 2157.724520: 8,32 m N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
    fio-874 [007] 2157.724521: 8,32 m N cfq830 set_active wl_prio:0 wl_type:0
    fio-874 [007] 2157.724522: 8,32 m N cfq830 Not idling. st->count:1

    cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
    have cfqg->saved_workload_slice mechanism, the preempt is a nop.
    Looks currently our preempt is totally broken if the two queues are not from
    the same workload type.
    Below patch fixes it. This will might make async queue starvation, but it's
    what our old code does before cgroup is added.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds