07 Jan, 2012

2 commits

  • commit 5eb46851de3904cd1be9192fdacb8d34deadc1fc upstream.

    cfq_cic_link() has race condition. When some processes which shared ioc
    issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
    sometimes. The race condition might stop I/O by following steps:

    step 1: Process A: Issue an I/O to /dev/sda
    step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not
    linked with a cic for the device
    step 3: Process A: Get a new cic for the device (cicA here) in
    cfq_alloc_io_context()

    step 4: Process B: Issue an I/O to /dev/sda
    step 5: Process B: Get iocA in get_io_context() since process A and B share the
    same ioc
    step 6: Process B: Get a new cic for the device (cicB here) in
    cfq_alloc_io_context() since iocA has not been linked with a
    cic for the device yet

    step 7: Process A: Link cicA to iocA in cfq_cic_link()
    step 8: Process A: Dispatch I/O to driver and finish it

    step 9: Process B: Try to link cicB to iocA in cfq_cic_link()
    But it fails with showing "cfq: cic link failed!" kernel
    message, since iocA has already linked with cicA at step 7.
    step 10: Process B: Wait for finishig I/O in get_request_wait()
    The function does not wake up, when there is no I/O to the
    device.

    When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
    So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     
  • commit 2984ff38ccf6cbc02a7a996a36c7d6f69f3c6146 upstream.

    If we fail allocating the blkpg stats, we free cfqd and cfgq.
    But we need to free the IDA cfqd->cic_index as well.

    Signed-off-by: majianpeng
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    majianpeng
     

27 Jun, 2011

2 commits

  • ioc->ioc_data is rcu protectd, so uses correct API to access it.
    This doesn't change any behavior, but just make code consistent.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
    doesn't hold rcu_read_lock.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     

13 Jun, 2011

1 commit


06 Jun, 2011

1 commit

  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Jun, 2011

1 commit


01 Jun, 2011

1 commit


24 May, 2011

4 commits


23 May, 2011

1 commit


21 May, 2011

4 commits

  • Currently we take blkg_stat lock for even updating the stats. So even if
    a group has no throttling rules (common case for root group), we end
    up taking blkg_lock, for updating the stats.

    Make dispatch stats per cpu so that these can be updated without taking
    blkg lock.

    If cpu goes offline, these stats simply disappear. No protection has
    been provided for that yet. Do we really need anything for that?

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently, all the cfq_group or throtl_group allocations happen while
    we are holding ->queue_lock and sleeping is not allowed.

    Soon, we will move to per cpu stats and also need to allocate the
    per group stats. As one can not call alloc_percpu() from atomic
    context as it can sleep, we need to drop ->queue_lock, allocate the
    group, retake the lock and continue processing.

    In throttling code, I check the queue DEAD flag again to make sure
    that driver did not call blk_cleanup_queue() in the mean time.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • blkg->key = cfqd is an rcu protected pointer and hence we used to do
    call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

    The problem here is that even though cfqd is around, there are no
    gurantees that associated request queue (td->queue) or q->queue_lock
    is still around. A driver might have called blk_cleanup_queue() and
    release the lock.

    It might happen that after freeing up the lock we call
    blkg->key->queue->queue_ock and crash. This is possible in following
    path.

    blkiocg_destroy()
    blkio_unlink_group_fn()
    cfq_unlink_blkio_group()

    Hence, wait for an rcu peirod if there are groups which have not
    been unlinked from blkcg->blkg_list. That way, if there are any groups
    which are taking cfq_unlink_blkio_group() path, can safely take queue
    lock.

    This is how we have taken care of race in throttling logic also.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
    Get rid of that.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

19 Apr, 2011

1 commit

  • For some configurations of CONFIG_PREEMPT that is not true. So
    get rid of __call_for_each_cic() and always uses the explicitly
    rcu_read_lock() protected call_for_each_cic() instead.

    This fixes a potential bug related to IO scheduler removal or
    online switching.

    Thanks to Paul McKenney for clarifying this.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Mar, 2011

1 commit


23 Mar, 2011

3 commits

  • Removing think time checking. A high thinktime queue might means the queue
    dispatches several requests and then do away. Limitting such queue seems
    meaningless. And also this can simplify code. This is suggested by Vivek.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Li, Shaohua
     
  • For v2, I added back lines to cfq_preempt_queue() that were removed
    during updates for accounting unaccounted_time. Thanks for pointing out
    that I'd missed these, Vivek.

    Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly
    cleared stats for preempting queues when it shouldn't have, because when
    we choose a queue to preempt, it still isn't necessarily scheduled next.

    Thanks to Vivek Goyal for figuring this out and understanding how the
    preemption code works.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Commit "Add unaccounted time to timeslice_used" changed the behavior of
    cfq_preempt_queue to set cfqq active. Vivek pointed out that other
    preemption rules might get involved, so we shouldn't manually set which
    queue is active.

    This cleans up the code to just clear the queue stats at preemption
    time.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

17 Mar, 2011

1 commit

  • Version 3 is updated to apply to for-2.6.39/core.

    For version 2, I took Vivek's advice and made sure we update the group
    weight from cfq_group_service_tree_add().

    If a weight was updated while a group is on the service tree, the
    calculation for the total weight of the service tree can be adjusted
    improperly, which either leads to bad service tree weights, or
    potentially crashes (if total_weight becomes 0).

    This patch defers updates to the weight until a group is off the service
    tree.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

12 Mar, 2011

1 commit

  • There are two kind of times that tasks are not charged for: the first
    seek and the extra time slice used over the allocated timeslice. Both
    of these exported as a new unaccounted_time stat.

    I think it would be good to have this reported in 'time' as well, but
    that is probably a separate discussion.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

10 Mar, 2011

2 commits


07 Mar, 2011

4 commits


05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

02 Mar, 2011

3 commits

  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Effectively, make group_isolation=1 the default and remove the tunable.
    The setting group_isolation=0 was because by default we idle on
    sync-noidle tree and on fast devices, this can be very harmful for
    throughput.

    However, this problem can also be addressed by tuning slice_idle and
    possibly group_idle on faster storage devices.

    This change simplifies the CFQ code by removing the feature entirely.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Conflicts:
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Feb, 2011

1 commit

  • Flush requests are never put on the IO scheduler. Convert request
    structure's elevator_private* into an array and have the flush fields
    share a union with it.

    Reclaim the space lost in 'struct request' by moving 'completion_data'
    back in the union with 'rb_node'.

    Signed-off-by: Mike Snitzer
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

09 Feb, 2011

1 commit

  • Commit 7667aa0630407bc07dc38dcc79d29cc0a65553c1 added logic to wait for
    the last queue of the group to become busy (have at least one request),
    so that the group does not lose out for not being continuously
    backlogged. The commit did not check for the condition that the last
    queue already has some requests. As a result, if the queue already has
    requests, wait_busy is set. Later on, cfq_select_queue() checks the
    flag, and decides that since the queue has a request now and wait_busy
    is set, the queue is expired. This results in early expiration of the
    queue.

    This patch fixes the problem by adding a check to see if queue already
    has requests. If it does, wait_busy is not set. As a result, time slices
    do not expire early.

    The queues with more than one request are usually buffered writers.
    Testing shows improvement in isolation between buffered writers.

    Cc: stable@kernel.org
    Signed-off-by: Justin TerAvest
    Reviewed-by: Gui Jianfeng
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

19 Jan, 2011

1 commit


14 Jan, 2011

1 commit