12 May, 2009

1 commit

  • Current bio_vec array index out-of-bounds test within
    __end_that_request_first() does not seem correct.
    It checks bio->bi_idx against bio->bi_vcnt, but the subsequent code
    uses idx (which is, bio->bi_idx + next_idx) as the array index into
    bio_vec array. This means that the test really make sense only at
    the first iteration of !(nr_bytes >=bio->bi_size) case (when next_idx
    == zero). Fix this by replacing bio->bi_idx with idx.
    (This patch applies to 2.6.30-rc4.)

    Signed-off-by: Kazuhisa Ichikawa
    Signed-off-by: Jens Axboe

    Kazuhisa Ichikawa
     

24 Apr, 2009

5 commits

  • Currently we look it up from ->ioprio, but ->ioprio can change if
    either the process gets its IO priority changed explicitly, or if
    cfq decides to temporarily boost it. So if we are unlucky, we can
    end up attempting to remove a node from a different rbtree root than
    where it was added.

    Fix this by using ->org_ioprio as the prio_tree index, since that
    will only change for explicit IO priority settings (not for a boost).
    Additionally cache the rbtree root inside the cfqq, then we don't have
    to add code to reinsert the cfqq in the prio_tree if IO priority changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_prio_tree_lookup() should return the direct match, yet it always
    returns zero. Fix that.

    cfq_prio_tree_add() assumes that we don't get a direct match, while
    it is very possible that we do. Using O_DIRECT, you can have different
    cfqq with matching requests, since you don't have the page cache
    to serialize things for you. Fix this bug by only adding the cfqq if
    there isn't an existing match.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Not strictly needed, but we should make it clear that we init the
    rbtree roots here.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Very rarely under stress testing of dm, oopses are occuring as
    something tampers with an old stack frame. This has been traced back
    to blk_abort_queue() leaving a timeout_list pointing to the stack.
    The reason is that sometimes blk_abort_request() won't delete the
    timer (if the request is marked as complete but before the timer has
    been removed, a small race window). Fix this by splicing back from
    the ususally empty list to the q->timeout_list.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • This simplifies I/O stat accounting switching code and separates it
    completely from I/O scheduler switch code.

    Requests are accounted according to the state of their request queue
    at the time of the request allocation. There is no need anymore to
    flush the request queue when switching I/O accounting state.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

22 Apr, 2009

6 commits

  • If the cfq io context doesn't have enough samples yet to provide a mean
    seek distance, then use the default threshold we have for seeky IO instead
    of defaulting to 0.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Right now, depending on the first sector to which a process issues I/O,
    the seek time may start out way out of whack. So make sure we start
    with 0 sectors in seek, instead of the offset of the first request
    issued.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • There's nothing to do for those devices, since the timeout handling is
    based on requests.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • /proc/diskstats used to show stats for all disks whether they're
    zero-sized or not and their non-zero partitions. Commit
    074a7aca7afa6f230104e8e65eba3420263714a5 accidentally changed the
    behavior such that it doesn't print out zero sized disks. This patch
    implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and
    uses it in diskstats_show() such that empty part0 is shown in
    /proc/diskstats.

    Reported and bisectd by Dianel Collins.

    Signed-off-by: Tejun Heo
    Reported-by: Daniel Collins
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily

    All DMA address limits are expressed in terms of the last addressable
    unit (byte or page) instead of one plus that. However, when
    determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(),
    it compares the specified limit against 0x100000000UL to determine
    whether it's below 4G ending up falsely setting GFP_DMA in
    q->bounce_gfp.

    As DMA zone is very small on x86_64, this makes larger SG_IO transfers
    very eager to trigger OOM killer. Fix it. While at it, rename the
    parameter to @dma_mask for clarity and convert comment to proper
    winged style.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Impact: fix SG_IO behavior such that it matches the documentation

    SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
    shorter one wins. However, the current implementation returns -EINVAL
    for such cases. Trim iovc if it's longer than ->dxfer_len.

    This patch uses iov_*() helpers which take struct iovec * by casting
    struct sg_iovec * to it. sg_iovec is always identical to iovec and
    this will be further cleaned up with later patches.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Apr, 2009

11 commits


08 Apr, 2009

2 commits

  • …nel/git/tip/linux-2.6-tip

    * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y
    branch tracer: Fix for enabling branch profiling makes sparse unusable
    ftrace: Correct a text align for event format output
    Update /debug/tracing/README
    tracing/ftrace: alloc the started cpumask for the trace file
    tracing, x86: remove duplicated #include
    ftrace: Add check of sched_stopped for probe_sched_wakeup
    function-graph: add proper initialization for init task
    tracing/ftrace: fix missing include string.h
    tracing: fix incorrect return type of ns2usecs()
    tracing: remove CALLER_ADDR2 from wakeup tracer
    blktrace: fix pdu_len when tracing packet command requests
    blktrace: small cleanup in blk_msg_write()
    blktrace: NUL-terminate user space messages
    tracing: move scripts/trace/power.pl to scripts/tracing/power.pl

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    loop: mutex already unlocked in loop_clr_fd()
    cfq-iosched: don't let idling interfere with plugging
    block: remove unused REQ_UNPLUG
    cfq-iosched: kill two unused cfqq flags
    cfq-iosched: change dispatch logic to deal with single requests at the time
    mflash: initial support
    cciss: change to discover first memory BAR
    cciss: kernel scan thread for MSA2012
    cciss: fix residual count for block pc requests
    block: fix inconsistency in I/O stat accounting code
    block: elevator quiescing helpers

    Linus Torvalds
     

07 Apr, 2009

8 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
    sata_mv: shorten register names
    sata_mv: workaround errata SATA#13
    sata_mv: cosmetic renames
    sata_mv: workaround errata SATA#26
    sata_mv: workaround errata PCI#7
    sata_mv: replace 0x1f with ATA_PIO4 (v2)
    sata_mv: fix irq mask races
    sata_mv: revert SoC irq breakage
    libata: ahci enclosure management bios workaround
    ata: Add TRIM infrastructure
    ata_piix: VGN-BX297XP wants the controller power up on suspend
    libata: Remove some redundant casts from pata_octeon_cf.c
    pata_artop: typo

    Linus Torvalds
     
  • When CFQ is waiting for a new request from a process, currently it'll
    immediately restart queuing when it sees such a request. This doesn't
    work very well with streamed IO, since we then end up splitting IO
    that would otherwise have been merged nicely. For a simple dd test,
    this causes 10x as many requests to be issued as we should have.
    Normally this goes unnoticed due to the low overhead of requests
    at the device side, but some hardware is very sensitive to request
    sizes and there it can cause big slow downs.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The request inherits the unplug flag from the bio, but it isn't actually
    used. The bio flag stops at __make_request(), which tells it to unplug
    after submission. Passing it on to the request doesn't make any sense.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We only manipulate the must_dispatch and queue_new flags, they are not
    tested anymore. So get rid of them.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The IO scheduler core calls into the IO scheduler dispatch_request hook
    to move requests from the IO scheduler and into the driver dispatch
    list. It only does so when the dispatch list is empty. CFQ moves several
    requests to the dispatch list, which can cause higher latencies if we
    suddenly have to switch to some important sync IO. Change the logic to
    move one request at the time instead.

    This should almost be functionally equivalent to what we did before,
    except that we now honor 'quantum' as the maximum queue depth at the
    device side from any single cfqq. If there's just a single active
    cfqq, we allow up to 4 times the normal quantum.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This forces in_flight to be zero when turning off or on the I/O stat
    accounting and stops updating I/O stats in attempt_merge() when
    accounting is turned off.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Jens Axboe

    Jerome Marchand
     
  • Simple helper functions to quiesce the request queue. These are
    currently only used for switching IO schedulers on-the-fly, but
    we can use them to properly switch IO accounting on and off as well.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Fix a typo (this was in the original patch but was not merged when the code
    fixes were for some reason)

    Signed-off-by: Alan Cox
    Signed-off-by: Jeff Garzik

    Alan Cox
     

06 Apr, 2009

5 commits

  • By default, CFQ will anticipate more IO from a given io context if the
    previously completed IO was sync. This used to be fine, since the only
    sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
    being used now, we don't want to anticipate for those.

    Add a bio/request flag that informs the IO scheduler that this is a sync
    request that we should not idle for. Introduce WRITE_ODIRECT specifically
    for O_DIRECT writes, and make sure that the other sync writes set this
    flag.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • For the older SSD devices that don't do command queuing, we do want to
    enable plugging to get better merging.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • This makes sure that we never wait on async IO for sync requests, instead
    of doing the split on writes vs reads.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
    tracing, net: fix net tree and tracing tree merge interaction
    tracing, powerpc: fix powerpc tree and tracing tree interaction
    ring-buffer: do not remove reader page from list on ring buffer free
    function-graph: allow unregistering twice
    trace: make argument 'mem' of trace_seq_putmem() const
    tracing: add missing 'extern' keywords to trace_output.h
    tracing: provide trace_seq_reserve()
    blktrace: print out BLK_TN_MESSAGE properly
    blktrace: extract duplidate code
    blktrace: fix memory leak when freeing struct blk_io_trace
    blktrace: fix blk_probes_ref chaos
    blktrace: make classic output more classic
    blktrace: fix off-by-one bug
    blktrace: fix the original blktrace
    blktrace: fix a race when creating blk_tree_root in debugfs
    blktrace: fix timestamp in binary output
    tracing, Text Edit Lock: cleanup
    tracing: filter fix for TRACE_EVENT_FORMAT events
    ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
    x86: kretprobe-booster interrupt emulation code fix
    ...

    Fix up trivial conflicts in
    arch/parisc/include/asm/ftrace.h
    include/linux/memory.h
    kernel/extable.c
    kernel/module.c

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
    cpumask: remove cpumask allocation from idle_balance, fix
    numa, cpumask: move numa_node_id default implementation to topology.h, fix
    cpumask: remove cpumask allocation from idle_balance
    x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
    x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
    x86: microcode: cleanup
    x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
    cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
    numa, cpumask: move numa_node_id default implementation to topology.h
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    cpumask: remove x86 cpumask_t uses.
    cpumask: use cpumask_var_t in uv_flush_tlb_others.
    cpumask: remove cpumask_t assignment from vector_allocation_domain()
    cpumask: make Xen use the new operators.
    cpumask: clean up summit's send_IPI functions
    cpumask: use new cpumask functions throughout x86
    x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
    cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
    cpumask: convert node_to_cpumask_map[] to cpumask_var_t
    x86: unify 32 and 64-bit node_to_cpumask_map
    ...

    Linus Torvalds
     

04 Apr, 2009

1 commit


03 Apr, 2009

1 commit

  • Impact: output all of packet commands - not just the first 4 / 8 bytes

    Since commit d7e3c3249ef23b4617393c69fe464765b4ff1645 ("block: add
    large command support"), struct request->cmd has been changed from
    unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd.

    v1 -> v2: by: FUJITA Tomonori

    - make sure rq->cmd_len is always intialized, and then we can use
    rq->cmd_len instead of BLK_MAX_CDB.

    Signed-off-by: Li Zefan
    Acked-by: FUJITA Tomonori
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Jens Axboe
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan