24 Apr, 2009

3 commits

  • Currently we look it up from ->ioprio, but ->ioprio can change if
    either the process gets its IO priority changed explicitly, or if
    cfq decides to temporarily boost it. So if we are unlucky, we can
    end up attempting to remove a node from a different rbtree root than
    where it was added.

    Fix this by using ->org_ioprio as the prio_tree index, since that
    will only change for explicit IO priority settings (not for a boost).
    Additionally cache the rbtree root inside the cfqq, then we don't have
    to add code to reinsert the cfqq in the prio_tree if IO priority changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_prio_tree_lookup() should return the direct match, yet it always
    returns zero. Fix that.

    cfq_prio_tree_add() assumes that we don't get a direct match, while
    it is very possible that we do. Using O_DIRECT, you can have different
    cfqq with matching requests, since you don't have the page cache
    to serialize things for you. Fix this bug by only adding the cfqq if
    there isn't an existing match.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not strictly needed, but we should make it clear that we init the
    rbtree roots here.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Apr, 2009

2 commits


15 Apr, 2009

7 commits

  • If we have processes that are working in close proximity to each
    other on disk, we don't want to idle wait. Instead allow the close
    process to issue a request, getting better aggregate bandwidth.
    The anticipatory scheduler has similar checks, noop and deadline do
    not need it since they don't care about process io mappings.

    The code for CFQ is a little more involved though, since we split
    request queues into per-process contexts.

    This fixes a performance problem with eg dump(8), since it uses
    several processes in some silly attempt to speed IO up. Even if
    dump(8) isn't really a valid case (it should be fixed by using
    CLONE_IO), there are other cases where we see close processes
    and where idling ends up hurting performance.

    Credit goes to Jeff Moyer for writing the
    initial implementation.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Makes it easier to read the traces.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only kick the dispatch for an idling queue, if we think it's a
    (somewhat) fully merged request. Also allow a kick if we have other
    busy queues in the system, since we don't want to risk waiting for
    a potential merge in that case. It's better to get some work done and
    proceed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's called from the workqueue handlers from process context, so
    we always have irqs enabled when entered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • "Zhang, Yanmin" reports that commit
    b029195dda0129b427c6e579a3bb3ae752da3a93 introduced a regression
    of about 50% with sequential threaded read workloads. The test
    case is:

    tiotest -k0 -k1 -k3 -f 80 -t 32

    which starts 32 threads each reading a 80MB file. Twiddle the kick
    queue logic so that we do start IO immediately, if it appears to be
    a fully merged request. We can't really detect that, so just check
    if the request is bigger than a page or not. The assumption is that
    since single bio issues will first queue a single request with just
    one page attached and then later do merges on that, if we already
    have more than a page worth of data in the request, then the request
    is most likely good to go.

    Verified that this doesn't cause a regression with the test case that
    commit b029195dda0129b427c6e579a3bb3ae752da3a93 was fixing. It does not,
    we still see maximum sized requests for the queue-then-merge cases.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can just use the block layer BLK_RW_SYNC/ASYNC defines now.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Apr, 2009

3 commits

  • When CFQ is waiting for a new request from a process, currently it'll
    immediately restart queuing when it sees such a request. This doesn't
    work very well with streamed IO, since we then end up splitting IO
    that would otherwise have been merged nicely. For a simple dd test,
    this causes 10x as many requests to be issued as we should have.
    Normally this goes unnoticed due to the low overhead of requests
    at the device side, but some hardware is very sensitive to request
    sizes and there it can cause big slow downs.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only manipulate the must_dispatch and queue_new flags, they are not
    tested anymore. So get rid of them.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The IO scheduler core calls into the IO scheduler dispatch_request hook
    to move requests from the IO scheduler and into the driver dispatch
    list. It only does so when the dispatch list is empty. CFQ moves several
    requests to the dispatch list, which can cause higher latencies if we
    suddenly have to switch to some important sync IO. Change the logic to
    move one request at the time instead.

    This should almost be functionally equivalent to what we did before,
    except that we now honor 'quantum' as the maximum queue depth at the
    device side from any single cfqq. If there's just a single active
    cfqq, we allow up to 4 times the normal quantum.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Apr, 2009

1 commit

  • By default, CFQ will anticipate more IO from a given io context if the
    previously completed IO was sync. This used to be fine, since the only
    sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
    being used now, we don't want to anticipate for those.

    Add a bio/request flag that informs the IO scheduler that this is a sync
    request that we should not idle for. Introduce WRITE_ODIRECT specifically
    for O_DIRECT writes, and make sure that the other sync writes set this
    flag.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

30 Jan, 2009

1 commit

  • This patch adds the ability to pre-empt an ongoing BE timeslice when a RT
    request is waiting for the current timeslice to complete. This reduces the
    wait time to disk for RT requests from an upper bound of 4 (current value
    of cfq_quantum) to 1 disk request.

    Applied Jens' suggeested changes to avoid the rb lookup and use !cfq_class_rt()
    and retested.

    Latency(secs) for the RT task when doing sequential reads from 10G file.
    | only RT | RT + BE | RT + BE + this patch
    small (512 byte) reads | 143 | 163 | 145
    large (1Mb) reads | 142 | 158 | 146

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

29 Dec, 2008

4 commits

  • Original patch from Nikanth Karthikesan

    When a queue exits the queue lock is taken and cfq_exit_queue() would free all
    the cic's associated with the queue.

    But when a task exits, cfq_exit_io_context() gets cic one by one and then
    locks the associated queue to call __cfq_exit_single_io_context. It looks like
    between getting a cic from the ioc and locking the queue, the queue might have
    exited on another cpu.

    Fix this by rechecking the cfq_io_context queue key inside the queue lock
    again, and not calling into __cfq_exit_single_io_context() if somebody
    beat us to it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This basically limits the hardware queue depth to 4*quantum at any
    point in time, which is 16 with the default settings. As CFQ uses
    other means to shrink the hardware queue when necessary in the first
    place, there's really no need for this extra heuristic. Additionally,
    it ends up hurting performance in some cases.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just use struct elevator_queue everywhere instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • After many improvements on kblockd_flush_work, it is now identical to
    cancel_work_sync, so a direct call to cancel_work_sync is suggested.

    The only difference is that cancel_work_sync is a GPL symbol,
    so no non-GPL modules anymore.

    Signed-off-by: Cheng Renquan
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Cheng Renquan
     

09 Oct, 2008

4 commits

  • We really need to know about the hardware tagging support as well,
    since if the SSD does not do tagging then we still want to idle.
    Otherwise have the same dependent sync IO vs flooding async IO
    problem as on rotational media.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't want to idle in AS/CFQ if the device doesn't have a seek
    penalty. So add a QUEUE_FLAG_NONROT to indicate a non-rotational
    device, low level drivers should set this flag upon discovery of
    an SSD or similar device type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • CFQ's detection of queueing devices assumes a non-queuing device and detects
    if the queue depth reaches a certain threshold. Under some workloads (e.g.
    synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
    the detection logic. This leads to poor performance on queuing hardware,
    since the idle window remains enabled.

    This patch inverts the sense of the logic: assume a queuing-capable device,
    and detect if the depth does not exceed the threshold.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     
  • Preparatory patch for checking queuing affinity.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Jul, 2008

3 commits


28 May, 2008

2 commits


07 May, 2008

2 commits


10 Apr, 2008

1 commit

  • When switching scheduler from cfq, cfq_exit_queue() does not clear
    ioc->ioc_data, leaving a dangling pointer that can deceive the following
    lookups when the iosched is switched back to cfq. The pattern that can
    trigger that is the following:

    - elevator switch from cfq to something else;
    - module unloading, with elv_unregister() that calls cfq_free_io_context()
    on ioc freeing the cic (via the .trim op);
    - module gets reloaded and the elevator switches back to cfq;
    - reallocation of a cic at the same address as before (with a valid key).

    To fix it just assign NULL to ioc_data in __cfq_exit_single_io_context(),
    that is called from the regular exit path and from the elevator switching
    code. The only path that frees a cic and is not covered is the error handling
    one, but cic's freed in this way are never cached in ioc_data.

    Signed-off-by: Fabio Checconi
    Signed-off-by: Jens Axboe

    Fabio Checconi
     

02 Apr, 2008

1 commit


19 Feb, 2008

1 commit


01 Feb, 2008

1 commit


28 Jan, 2008

4 commits

  • Use of inlines were a bit over the top, trim them down a bit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently you must be root to set idle io prio class on a process. This
    is due to the fact that the idle class is implemented as a true idle
    class, meaning that it will not make progress if someone else is
    requesting disk access. Unfortunately this means that it opens DOS
    opportunities by locking down file system resources, hence it is root
    only at the moment.

    This patch relaxes the idle class a little, by removing the truly idle
    part (which entals a grace period with associated timer). The
    modifications make the idle class as close to zero impact as can be done
    while still guarenteeing progress. This means we can relax the root only
    criteria as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The io context sharing introduced a per-ioc spinlock, that would protect
    the cfq io context lookup. That is a regression from the original, since
    we never needed any locking there because the ioc/cic were process private.

    The cic lookup is changed from an rbtree construct to a radix tree, which
    we can then use RCU to make the reader side lockless. That is the performance
    critical path, modifying the radix tree is only done on process creation
    (when that process first does IO, actually) and on process exit (if that
    process has done IO).

    As it so happens, radix trees are also much faster for this type of
    lookup where the key is a pointer. It's a very sparse tree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • changes in the cfq for io_context sharing

    Signed-off-by: Jens Axboe

    Nikanth Karthikesan