14 Aug, 2009

1 commit

  • Conflicts:
    arch/sparc/kernel/smp_64.c
    arch/x86/kernel/cpu/perf_counter.c
    arch/x86/kernel/setup_percpu.c
    drivers/cpufreq/cpufreq_ondemand.c
    mm/percpu.c

    Conflicts in core and arch percpu codes are mostly from commit
    ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
    num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
    the first chunk allocators into mm/percpu.c, the changes are moved
    from arch code to mm/percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

11 Jul, 2009

1 commit

  • In case memory is scarce, we now default to oom_cfqq. Once memory is
    available again, we should allocate a new cfqq and stop using oom_cfqq for
    a particular io context.

    Once a new request comes in, check if we are using oom_cfqq, and if yes,
    try to allocate a new cfqq.

    Tested the patch by forcing the use of oom_cfqq and upon next request thread
    realized that it was using oom_cfqq and it allocated a new cfqq.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

04 Jul, 2009

1 commit

  • Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
    changes. As alpha in percpu tree uses 'weak' attribute instead of
    inline assembly, there's no need for __used attribute.

    Conflicts:
    arch/alpha/include/asm/percpu.h
    arch/mn10300/kernel/vmlinux.lds.S
    include/linux/percpu-defs.h

    Tejun Heo
     

01 Jul, 2009

3 commits


24 Jun, 2009

1 commit

  • Percpu variable definition is about to be updated such that all percpu
    symbols including the static ones must be unique. Update percpu
    variable definitions accordingly.

    * as,cfq: rename ioc_count uniquely

    * cpufreq: rename cpu_dbs_info uniquely

    * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it

    * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
    rename it

    * ipv4,6: rename cookie_scratch uniquely

    * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
    pmc_irq_entry and nmi_entry to pmc_nmi_entry

    * perf_counter: rename disable_count to perf_disable_count

    * ftrace: rename test_event_disable to ftrace_test_event_disable

    * kmemleak: rename test_pointer to kmemleak_test_pointer

    * mce: rename next_interval to mce_next_interval

    [ Impact: percpu usage cleanups, no duplicate static percpu var names ]

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ivan Kokshaysky
    Cc: Jens Axboe
    Cc: Dave Jones
    Cc: Jeremy Fitzhardinge
    Cc: linux-mm
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Li Zefan
    Cc: Catalin Marinas
    Cc: Andi Kleen

    Tejun Heo
     

16 Jun, 2009

2 commits


11 Jun, 2009

1 commit

  • Currently io_context has an atomic_t(32-bit) as refcount. In the case of
    cfq, for each device against whcih a task does I/O, a reference to the
    io_context would be taken. And when there are multiple process sharing
    io_contexts(CLONE_IO) would also have a reference to the same io_context.

    Theoretically the possible maximum number of processes sharing the same
    io_context + the number of disks/cfq_data referring to the same io_context
    can overflow the 32-bit counter on a very high-end machine.

    Even though it is an improbable case, let us make it atomic_long_t.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

11 May, 2009

3 commits

  • struct request has had a few different ways to represent some
    properties of a request. ->hard_* represent block layer's view of the
    request progress (completion cursor) and the ones without the prefix
    are supposed to represent the issue cursor and allowed to be updated
    as necessary by the low level drivers. The thing is that as block
    layer supports partial completion, the two cursors really aren't
    necessary and only cause confusion. In addition, manual management of
    request detail from low level drivers is cumbersome and error-prone at
    the very least.

    Another interesting duplicate fields are rq->[hard_]nr_sectors and
    rq->{hard_cur|current}_nr_sectors against rq->data_len and
    rq->bio->bi_size. This is more convoluted than the hard_ case.

    rq->[hard_]nr_sectors are initialized for requests with bio but
    blk_rq_bytes() uses it only for !pc requests. rq->data_len is
    initialized for all request but blk_rq_bytes() uses it only for pc
    requests. This causes good amount of confusion throughout block layer
    and its drivers and determining the request length has been a bit of
    black magic which may or may not work depending on circumstances and
    what the specific LLD is actually doing.

    rq->{hard_cur|current}_nr_sectors represent the number of sectors in
    the contiguous data area at the front. This is mainly used by drivers
    which transfers data by walking request segment-by-segment. This
    value always equals rq->bio->bi_size >> 9. However, data length for
    pc requests may not be multiple of 512 bytes and using this field
    becomes a bit confusing.

    In general, having multiple fields to represent the same property
    leads only to confusion and subtle bugs. With recent block low level
    driver cleanups, no driver is accessing or manipulating these
    duplicate fields directly. Drop all the duplicates. Now rq->sector
    means the current sector, rq->data_len the current total length and
    rq->bio->bi_size the current segment length. Everything else is
    defined in terms of these three and available only through accessors.

    * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
    now handles pc and fs requests equally other than rq->sector update.
    This means that now pc requests can use partial completion too (no
    in-kernel user yet tho).

    * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
    now uses byte count as the primary data length.

    * blk_rq_pos() is now guranteed to be always correct. In-block users
    converted.

    * blk_rq_bytes() is now guaranteed to be always valid as is
    blk_rq_sectors(). In-block users converted.

    * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
    More convenient one is used.

    * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
    pointer to request.

    [ Impact: API cleanup, single way to represent one property of a request ]

    Signed-off-by: Tejun Heo
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With recent cleanups, there is no place where low level driver
    directly manipulates request fields. This means that the 'hard'
    request fields always equal the !hard fields. Convert all
    rq->sectors, nr_sectors and current_nr_sectors references to
    accessors.

    While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

    [ Impact: use pos and nr_sectors accessors ]

    Signed-off-by: Tejun Heo
    Acked-by: Geert Uytterhoeven
    Tested-by: Grant Likely
    Acked-by: Grant Likely
    Tested-by: Adrian McMenamin
    Acked-by: Adrian McMenamin
    Acked-by: Mike Miller
    Cc: James Bottomley
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: Sergei Shtylyov
    Cc: Eric Moore
    Cc: Alan Stern
    Cc: FUJITA Tomonori
    Cc: Pete Zaitcev
    Cc: Stephen Rothwell
    Cc: Paul Clements
    Cc: Tim Waugh
    Cc: Jeff Garzik
    Cc: Jeremy Fitzhardinge
    Cc: Alex Dubov
    Cc: David Woodhouse
    Cc: Martin Schwidefsky
    Cc: Dario Ballabio
    Cc: David S. Miller
    Cc: Rusty Russell
    Cc: unsik Kim
    Cc: Laurent Vivier
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Implement accessors - blk_rq_pos(), blk_rq_sectors() and
    blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors
    and rq->hard_cur_sectors respectively and convert direct references of
    the said fields to the accessors.

    This is in preparation of request data length handling cleanup.

    Geert : suggested adding const to struct request * parameter to accessors
    Sergei : spotted error in patch description

    [ Impact: cleanup ]

    Signed-off-by: Tejun Heo
    Acked-by: Geert Uytterhoeven
    Acked-by: Stephen Rothwell
    Tested-by: Grant Likely
    Acked-by: Grant Likely
    Ackec-by: Sergei Shtylyov
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: James Bottomley
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Apr, 2009

1 commit

  • blk_start_queueing() is identical to __blk_run_queue() except that it
    doesn't check for recursion. None of the current users depends on
    blk_start_queueing() running request_fn directly. Replace usages of
    blk_start_queueing() with [__]blk_run_queue() and kill it.

    [ Impact: removal of mostly duplicate interface function ]

    Signed-off-by: Tejun Heo

    Tejun Heo
     

24 Apr, 2009

3 commits

  • Currently we look it up from ->ioprio, but ->ioprio can change if
    either the process gets its IO priority changed explicitly, or if
    cfq decides to temporarily boost it. So if we are unlucky, we can
    end up attempting to remove a node from a different rbtree root than
    where it was added.

    Fix this by using ->org_ioprio as the prio_tree index, since that
    will only change for explicit IO priority settings (not for a boost).
    Additionally cache the rbtree root inside the cfqq, then we don't have
    to add code to reinsert the cfqq in the prio_tree if IO priority changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_prio_tree_lookup() should return the direct match, yet it always
    returns zero. Fix that.

    cfq_prio_tree_add() assumes that we don't get a direct match, while
    it is very possible that we do. Using O_DIRECT, you can have different
    cfqq with matching requests, since you don't have the page cache
    to serialize things for you. Fix this bug by only adding the cfqq if
    there isn't an existing match.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not strictly needed, but we should make it clear that we init the
    rbtree roots here.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Apr, 2009

2 commits


15 Apr, 2009

7 commits

  • If we have processes that are working in close proximity to each
    other on disk, we don't want to idle wait. Instead allow the close
    process to issue a request, getting better aggregate bandwidth.
    The anticipatory scheduler has similar checks, noop and deadline do
    not need it since they don't care about process io mappings.

    The code for CFQ is a little more involved though, since we split
    request queues into per-process contexts.

    This fixes a performance problem with eg dump(8), since it uses
    several processes in some silly attempt to speed IO up. Even if
    dump(8) isn't really a valid case (it should be fixed by using
    CLONE_IO), there are other cases where we see close processes
    and where idling ends up hurting performance.

    Credit goes to Jeff Moyer for writing the
    initial implementation.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Makes it easier to read the traces.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only kick the dispatch for an idling queue, if we think it's a
    (somewhat) fully merged request. Also allow a kick if we have other
    busy queues in the system, since we don't want to risk waiting for
    a potential merge in that case. It's better to get some work done and
    proceed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's called from the workqueue handlers from process context, so
    we always have irqs enabled when entered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • "Zhang, Yanmin" reports that commit
    b029195dda0129b427c6e579a3bb3ae752da3a93 introduced a regression
    of about 50% with sequential threaded read workloads. The test
    case is:

    tiotest -k0 -k1 -k3 -f 80 -t 32

    which starts 32 threads each reading a 80MB file. Twiddle the kick
    queue logic so that we do start IO immediately, if it appears to be
    a fully merged request. We can't really detect that, so just check
    if the request is bigger than a page or not. The assumption is that
    since single bio issues will first queue a single request with just
    one page attached and then later do merges on that, if we already
    have more than a page worth of data in the request, then the request
    is most likely good to go.

    Verified that this doesn't cause a regression with the test case that
    commit b029195dda0129b427c6e579a3bb3ae752da3a93 was fixing. It does not,
    we still see maximum sized requests for the queue-then-merge cases.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can just use the block layer BLK_RW_SYNC/ASYNC defines now.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Apr, 2009

3 commits

  • When CFQ is waiting for a new request from a process, currently it'll
    immediately restart queuing when it sees such a request. This doesn't
    work very well with streamed IO, since we then end up splitting IO
    that would otherwise have been merged nicely. For a simple dd test,
    this causes 10x as many requests to be issued as we should have.
    Normally this goes unnoticed due to the low overhead of requests
    at the device side, but some hardware is very sensitive to request
    sizes and there it can cause big slow downs.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only manipulate the must_dispatch and queue_new flags, they are not
    tested anymore. So get rid of them.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The IO scheduler core calls into the IO scheduler dispatch_request hook
    to move requests from the IO scheduler and into the driver dispatch
    list. It only does so when the dispatch list is empty. CFQ moves several
    requests to the dispatch list, which can cause higher latencies if we
    suddenly have to switch to some important sync IO. Change the logic to
    move one request at the time instead.

    This should almost be functionally equivalent to what we did before,
    except that we now honor 'quantum' as the maximum queue depth at the
    device side from any single cfqq. If there's just a single active
    cfqq, we allow up to 4 times the normal quantum.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Apr, 2009

1 commit

  • By default, CFQ will anticipate more IO from a given io context if the
    previously completed IO was sync. This used to be fine, since the only
    sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
    being used now, we don't want to anticipate for those.

    Add a bio/request flag that informs the IO scheduler that this is a sync
    request that we should not idle for. Introduce WRITE_ODIRECT specifically
    for O_DIRECT writes, and make sure that the other sync writes set this
    flag.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

30 Jan, 2009

1 commit

  • This patch adds the ability to pre-empt an ongoing BE timeslice when a RT
    request is waiting for the current timeslice to complete. This reduces the
    wait time to disk for RT requests from an upper bound of 4 (current value
    of cfq_quantum) to 1 disk request.

    Applied Jens' suggeested changes to avoid the rb lookup and use !cfq_class_rt()
    and retested.

    Latency(secs) for the RT task when doing sequential reads from 10G file.
    | only RT | RT + BE | RT + BE + this patch
    small (512 byte) reads | 143 | 163 | 145
    large (1Mb) reads | 142 | 158 | 146

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

29 Dec, 2008

4 commits

  • Original patch from Nikanth Karthikesan

    When a queue exits the queue lock is taken and cfq_exit_queue() would free all
    the cic's associated with the queue.

    But when a task exits, cfq_exit_io_context() gets cic one by one and then
    locks the associated queue to call __cfq_exit_single_io_context. It looks like
    between getting a cic from the ioc and locking the queue, the queue might have
    exited on another cpu.

    Fix this by rechecking the cfq_io_context queue key inside the queue lock
    again, and not calling into __cfq_exit_single_io_context() if somebody
    beat us to it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This basically limits the hardware queue depth to 4*quantum at any
    point in time, which is 16 with the default settings. As CFQ uses
    other means to shrink the hardware queue when necessary in the first
    place, there's really no need for this extra heuristic. Additionally,
    it ends up hurting performance in some cases.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just use struct elevator_queue everywhere instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • After many improvements on kblockd_flush_work, it is now identical to
    cancel_work_sync, so a direct call to cancel_work_sync is suggested.

    The only difference is that cancel_work_sync is a GPL symbol,
    so no non-GPL modules anymore.

    Signed-off-by: Cheng Renquan
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Cheng Renquan
     

09 Oct, 2008

4 commits

  • We really need to know about the hardware tagging support as well,
    since if the SSD does not do tagging then we still want to idle.
    Otherwise have the same dependent sync IO vs flooding async IO
    problem as on rotational media.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't want to idle in AS/CFQ if the device doesn't have a seek
    penalty. So add a QUEUE_FLAG_NONROT to indicate a non-rotational
    device, low level drivers should set this flag upon discovery of
    an SSD or similar device type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • CFQ's detection of queueing devices assumes a non-queuing device and detects
    if the queue depth reaches a certain threshold. Under some workloads (e.g.
    synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
    the detection logic. This leads to poor performance on queuing hardware,
    since the idle window remains enabled.

    This patch inverts the sense of the logic: assume a queuing-capable device,
    and detect if the depth does not exceed the threshold.

    Signed-off-by: Aaron Carroll
    Signed-off-by: Jens Axboe

    Aaron Carroll
     
  • Preparatory patch for checking queuing affinity.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Jul, 2008

1 commit