02 Dec, 2011

1 commit

  • cfq_cic_link() has race condition. When some processes which shared ioc
    issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
    sometimes. The race condition might stop I/O by following steps:

    step 1: Process A: Issue an I/O to /dev/sda
    step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not
    linked with a cic for the device
    step 3: Process A: Get a new cic for the device (cicA here) in
    cfq_alloc_io_context()

    step 4: Process B: Issue an I/O to /dev/sda
    step 5: Process B: Get iocA in get_io_context() since process A and B share the
    same ioc
    step 6: Process B: Get a new cic for the device (cicB here) in
    cfq_alloc_io_context() since iocA has not been linked with a
    cic for the device yet

    step 7: Process A: Link cicA to iocA in cfq_cic_link()
    step 8: Process A: Dispatch I/O to driver and finish it

    step 9: Process B: Try to link cicB to iocA in cfq_cic_link()
    But it fails with showing "cfq: cic link failed!" kernel
    message, since iocA has already linked with cicA at step 7.
    step 10: Process B: Wait for finishig I/O in get_request_wait()
    The function does not wake up, when there is no I/O to the
    device.

    When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
    So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().

    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     

30 Nov, 2011

1 commit


23 Aug, 2011

1 commit

  • Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
    and lave REQ_META purely for marking requests as metadata in blktrace.

    All existing callers of REQ_META except for XFS are updated to also
    set REQ_PRIO for now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Namhyung Kim
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Aug, 2011

1 commit

  • We have a kernel build regression since 3.1-rc1, which is about 10%
    regression. The kernel source is in an ext3 filesystem.
    Alex Shi bisect it to commit:
    commit a07405b7802691d29ab3b23bdc76ee6d006aad0b
    Author: Justin TerAvest
    Date: Sun Jul 10 22:09:19 2011 +0200

    cfq: Remove special treatment for metadata rqs.

    Apparently this is caused by lack metadata preemption, where ext3/ext4
    do use READ_META. I didn't see a way to fix the issue, so suggest
    reverting the patch.

    This reverts commit a07405b7802691d29ab3b23bdc76ee6d006aad0b.

    Reported-by: Alex Shi
    Reported-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Aug, 2011

1 commit

  • FQ keeps track of number of groups which are linked on blkcg->blkg_list.
    This is useful to avoid races between queue exit and cgroup exit code
    paths. So if at the request queue exit time linked group count is not
    zero, that means there are some group out there which is yet to be
    deleted under rcu read period and queue exit code should wait for
    on rcu period.

    In my previous patch I forgot to decrease the number of group count.
    So in current form, we nr_blkcg_linked_grps is always non-zero and
    we will always wait one rcu period (if BLK_CGROUP=y). The side effect
    of this is that it can increase boot time. I am surprised, nobody
    complained so far.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

12 Jul, 2011

4 commits

  • Currently when the last queue of a group has no request, we don't expire
    the queue to hope request from the group comes soon, so the group doesn't
    miss its share. But if the think time is big, the assumption isn't correct
    and we just waste bandwidth. In such case, we don't do idle.

    [global]
    runtime=30
    direct=1

    [test1]
    cgroup=test1
    cgroup_weight=1000
    rw=randread
    ioengine=libaio
    size=500m
    runtime=30
    directory=/mnt
    filename=file1
    thinktime=9000

    [test2]
    cgroup=test2
    cgroup_weight=1000
    rw=randread
    ioengine=libaio
    size=500m
    runtime=30
    directory=/mnt
    filename=file2

    patched base
    test1 64k 39k
    test2 548k 540k
    total 604k 578k

    group1 gets much better throughput because it waits less time.

    To check if the patch changes behavior of queue without think time. I also
    tried to give test1 2ms think time or no think time. The test result is stable.
    The thoughput doesn't change with/without the patch.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Currently when the last queue of a service tree has no request, we don't
    expire the queue to hope request from the service tree comes soon, so the
    service tree doesn't miss its share. But if the think time is big, the
    assumption isn't correct and we just waste bandwidth. In such case, we
    don't do idle.

    [global]
    runtime=10
    direct=1

    [test1]
    rw=randread
    ioengine=libaio
    size=500m
    directory=/mnt
    filename=file1
    thinktime=9000

    [test2]
    rw=read
    ioengine=libaio
    size=1G
    directory=/mnt
    filename=file2

    patched base
    test1 41k/s 33k/s
    test2 15868k/s 15789k/s
    total 15902k/s 15817k/s

    A slightly better

    To check if the patch changes behavior of queue without think time. I also
    tried to give test1 2ms think time or no think time. The test has variation
    even without the patch, but the average throughput doesn't change with/without
    the patch.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Move the variables to do think time check to a sepatate struct. This is
    to prepare adding think time check for service tree and group. No
    functional change.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

11 Jul, 2011

1 commit

  • There is no consistency among filesystems from what bios (or requests)
    are marked as being metadata. It's interesting to expose this in traces,
    but we shouldn't schedule the requests differently based on whether or
    not they're marked as being metadata.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

01 Jul, 2011

1 commit


27 Jun, 2011

2 commits

  • ioc->ioc_data is rcu protectd, so uses correct API to access it.
    This doesn't change any behavior, but just make code consistent.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
    doesn't hold rcu_read_lock.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     

14 Jun, 2011

1 commit


13 Jun, 2011

1 commit


06 Jun, 2011

3 commits

  • Correctly suggested by sparse.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jens Axboe

    Paul Bolle
     
  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Jun, 2011

1 commit

  • Hi, Jens,

    If you recall, I posted an RFC patch for this back in July of last year:
    http://lkml.org/lkml/2010/7/13/279

    The basic problem is that a process can issue a never-ending stream of
    async direct I/Os to the same sector on a device, thus starving out
    other I/O in the system (due to the way the alias handling works in both
    cfq and deadline). The solution I proposed back then was to start
    dispatching from the fifo after a certain number of aliases had been
    dispatched. Vivek asked why we had to treat aliases differently at all,
    and I never had a good answer. So, I put together a simple patch which
    allows aliases to be added to the rb tree (it adds them to the right,
    though that doesn't matter as the order isn't guaranteed anyway). I
    think this is the preferred solution, as it doesn't break up time slices
    in CFQ or batches in deadline. I've tested it, and it does solve the
    starvation issue. Let me know what you think.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

02 Jun, 2011

1 commit


01 Jun, 2011

1 commit


24 May, 2011

4 commits


23 May, 2011

1 commit


21 May, 2011

4 commits

  • Currently we take blkg_stat lock for even updating the stats. So even if
    a group has no throttling rules (common case for root group), we end
    up taking blkg_lock, for updating the stats.

    Make dispatch stats per cpu so that these can be updated without taking
    blkg lock.

    If cpu goes offline, these stats simply disappear. No protection has
    been provided for that yet. Do we really need anything for that?

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently, all the cfq_group or throtl_group allocations happen while
    we are holding ->queue_lock and sleeping is not allowed.

    Soon, we will move to per cpu stats and also need to allocate the
    per group stats. As one can not call alloc_percpu() from atomic
    context as it can sleep, we need to drop ->queue_lock, allocate the
    group, retake the lock and continue processing.

    In throttling code, I check the queue DEAD flag again to make sure
    that driver did not call blk_cleanup_queue() in the mean time.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • blkg->key = cfqd is an rcu protected pointer and hence we used to do
    call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

    The problem here is that even though cfqd is around, there are no
    gurantees that associated request queue (td->queue) or q->queue_lock
    is still around. A driver might have called blk_cleanup_queue() and
    release the lock.

    It might happen that after freeing up the lock we call
    blkg->key->queue->queue_ock and crash. This is possible in following
    path.

    blkiocg_destroy()
    blkio_unlink_group_fn()
    cfq_unlink_blkio_group()

    Hence, wait for an rcu peirod if there are groups which have not
    been unlinked from blkcg->blkg_list. That way, if there are any groups
    which are taking cfq_unlink_blkio_group() path, can safely take queue
    lock.

    This is how we have taken care of race in throttling logic also.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
    Get rid of that.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

19 Apr, 2011

1 commit

  • For some configurations of CONFIG_PREEMPT that is not true. So
    get rid of __call_for_each_cic() and always uses the explicitly
    rcu_read_lock() protected call_for_each_cic() instead.

    This fixes a potential bug related to IO scheduler removal or
    online switching.

    Thanks to Paul McKenney for clarifying this.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Mar, 2011

1 commit


23 Mar, 2011

3 commits

  • Removing think time checking. A high thinktime queue might means the queue
    dispatches several requests and then do away. Limitting such queue seems
    meaningless. And also this can simplify code. This is suggested by Vivek.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Li, Shaohua
     
  • For v2, I added back lines to cfq_preempt_queue() that were removed
    during updates for accounting unaccounted_time. Thanks for pointing out
    that I'd missed these, Vivek.

    Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly
    cleared stats for preempting queues when it shouldn't have, because when
    we choose a queue to preempt, it still isn't necessarily scheduled next.

    Thanks to Vivek Goyal for figuring this out and understanding how the
    preemption code works.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Commit "Add unaccounted time to timeslice_used" changed the behavior of
    cfq_preempt_queue to set cfqq active. Vivek pointed out that other
    preemption rules might get involved, so we shouldn't manually set which
    queue is active.

    This cleans up the code to just clear the queue stats at preemption
    time.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

17 Mar, 2011

1 commit

  • Version 3 is updated to apply to for-2.6.39/core.

    For version 2, I took Vivek's advice and made sure we update the group
    weight from cfq_group_service_tree_add().

    If a weight was updated while a group is on the service tree, the
    calculation for the total weight of the service tree can be adjusted
    improperly, which either leads to bad service tree weights, or
    potentially crashes (if total_weight becomes 0).

    This patch defers updates to the weight until a group is off the service
    tree.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

12 Mar, 2011

1 commit

  • There are two kind of times that tasks are not charged for: the first
    seek and the extra time slice used over the allocated timeslice. Both
    of these exported as a new unaccounted_time stat.

    I think it would be good to have this reported in 'time' as well, but
    that is probably a separate discussion.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

10 Mar, 2011

1 commit