04 Nov, 2009

2 commits

  • CFQ has an optimization for cooperated applications. if several
    io-context have close requests, they will get boost. But the
    optimization get abused. Considering thread a, b, which work on one
    file. a reads sectors s, s+2, s+4, ...; b reads sectors s+1, s+3, s
    +5, ... Both a and b are sequential read, so they can open idle window.
    a reads a sector s and goes to idle window and wakeup b. b reads sector
    s+1, since in current implementation, cfq_should_preempt() thinks a and
    b are cooperators, b will preempt a. b then reads sector s+1 and goes to
    idle window and wakeup a. for the same reason, a will preempt b and
    reads s+2. a and b will continue the circle. The circle will be very
    long, and a and b will occupy whole disk queue. Other applications will
    nearly have no chance to run.

    Fix this limiting coop preempt until a queue is scheduled normally
    again.

    Signed-off-by: Shaohua Li
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Commit a6151c3a5c8e1ff5a28450bc8d6a99a2a0add0a7 inadvertently reversed
    a preempt condition check, potentially causing a performance regression.
    Make the meta check correct again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Oct, 2009

1 commit

  • With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the
    following errors:

    end_request: I/O error, dev vda, sector 0
    end_request: I/O error, dev vda, sector 0

    The errors go away if dm stops submitting empty barriers, by reverting:

    commit 52b1fd5a27c625c78373e024bf570af3c9d44a79
    Author: Mikulas Patocka
    dm: send empty barriers to targets in dm_flush

    We should silently error all barriers, even empty barriers, on devices
    like virtio_blk which don't support them.

    See also:

    https://bugzilla.redhat.com/514901

    Signed-off-by: Mark McLoughlin
    Signed-off-by: Mike Snitzer
    Acked-by: Alasdair G Kergon
    Acked-by: Mikulas Patocka
    Cc: Rusty Russell
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Mark McLoughlin
     

12 Oct, 2009

1 commit


09 Oct, 2009

1 commit

  • elv_iosched_store() ignore the return value of strstrip(). It makes small
    inconsistent behavior.

    This patch fixes it.


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    bash: echo: write error: Invalid argument


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    noop [anticipatory] deadline cfq

    Cc: Li Zefan
    Cc: Jens Axboe
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Jens Axboe

    KOSAKI Motohiro
     

08 Oct, 2009

3 commits


07 Oct, 2009

4 commits

  • We should subtract the slice residual from the rb tree key, since
    a negative residual count indicates that the cfqq overran its slice
    the last time. Hence we want to add the overrun time, to position
    it a bit further away in the service tree.

    Reported-by: Corrado Zoccolo
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Makes the whole thing easier to read, cfq_dispatch_requests() was
    a bit messy before.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Makes it easier to read than the 0.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
    and write statistics of in_flight requests. And exported the number
    of read and write requests in progress seperately through sysfs.

    But Corrado Zoccolo reported getting strange
    output from "iostat -kx 2". Global values for service time and
    utilization were garbage. For interval values, utilization was always
    100%, and service time is higher than normal.

    So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b

    The problem was in part_round_stats_single(), I missed the following:
    if (now == part->stamp)
    return;

    - if (part->in_flight) {
    + if (part_in_flight(part)) {
    __part_stat_add(cpu, part, time_in_queue,
    part_in_flight(part) * (now - part->stamp));
    __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

    With this chunk included, the reported regression gets fixed.

    Signed-off-by: Nikanth Karthikesan

    --
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

05 Oct, 2009

5 commits

  • It was briefly introduced to allow CFQ to to delayed scheduling,
    but we ended up removing that feature again. So lets kill the
    function and export, and just switch CFQ back to the normal work
    schedule since it is now passing in a '0' delay from all call
    sites.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The RR service tree is indexed by a key that is relative to current jiffies.
    This can cause problems on jiffies wraparound.

    The patch fixes it using time_before comparison, and changing
    the add_front path to use a relative number, too.

    Signed-off-by: Corrado Zoccolo
    Signed-off-by: Jens Axboe

    Corrado Zoccolo
     
  • cfq uses rq->start_time as the fifo indicator, but that field may
    get modified prior to cfq doing it's fifo list adjustment when
    a request gets merged with another request. This can cause the
    fifo list to become unordered.

    Reported-by: Corrado Zoccolo
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

    Corrado Zoccolo reports:

    "with 2.6.32-rc1 I started getting the following strange output from
    "iostat -kx 2":
    Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

    avg-cpu: %user %nice %system %iowait %steal %idle
    10,70 0,00 3,16 15,75 0,00 70,38

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 18,22 0,00 0,67 0,01 14,77 0,02
    43,94 0,01 10,53 39043915,03 2629219,87
    sdb 60,89 9,68 50,79 3,04 1724,43 50,52
    65,95 0,70 13,06 488437,47 2629219,87

    avg-cpu: %user %nice %system %iowait %steal %idle
    2,72 0,00 0,74 0,00 0,00 96,53

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    6,68 0,00 0,99 0,00 0,00 92,33

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    4,40 0,00 0,73 1,47 0,00 93,40

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 4,00 0,00 3,00 0,00 28,00
    18,67 0,06 19,50 333,33 100,00

    Global values for service time and utilization are garbage. For
    interval values, utilization is always 100%, and service time is
    higher than normal.

    I bisected it down to:
    [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
    statistics of in_flight requests
    and verified that reverting just that commit indeed solves the issue
    on 2.6.32-rc1."

    So until this is debugged, revert the bad commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We cannot delay for the first dispatch of the async queue if it
    hasn't dispatched at all, since that could present a local user
    DoS attack vector using an app that just did slow timed sync reads
    while filling memory.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Oct, 2009

3 commits


03 Oct, 2009

3 commits

  • This slowly ramps up the async queue depth based on the time
    passed since the sync IO, and doesn't allow async at all until
    a sync slice period has passed.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • o Do not allow more than max_dispatch requests from an async queue, if some
    sync request has finished recently. This is in the hope that sync activity
    is still going on in the system and we might receive a sync request soon.
    Most likely from a sync queue which finished a request and we did not enable
    idling on it.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • This is basically identical to what Vivek Goyal posted, but combined
    into one and labelled 'desktop' instead of 'fairness'. The goal
    is to continue to improve on the latency side of things as it relates
    to interactiveness, keeping the questionable bits under this sysfs
    tunable so it would be easy for throughput-only people to turn off.

    Apart from adding the interactive sysfs knob, it also adds the
    behavioural change of allowing slice idling even if the hardware
    does tagged command queuing.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Oct, 2009

6 commits

  • Since 2.6.31 now has request-based device-mapper, it's useful to have
    a tracepoint for request-remapping as well as bio-remapping.
    This patch adds a tracepoint for request-remapping, trace_block_rq_remap().

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     
  • Currently we set the bio size to the byte equivalent of the blocks to
    be trimmed when submitting the initial DISCARD ioctl. That means it
    is subject to the max_hw_sectors limitation of the HBA which is
    much lower than the size of a DISCARD request we can support.
    Add a separate max_discard_sectors tunable to limit the size for discard
    requests.

    We limit the max discard request size in bytes to 32bit as that is the
    limit for bio->bi_size. This could be much larger if we had a way to pass
    that information through the block layer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • prepare_discard_fn() was being called in a place where memory allocation
    was effectively impossible. This makes it inappropriate for all but
    the most trivial translations of Linux's DISCARD operation to the block
    command set. Additionally adding a payload there makes the ownership
    of the bio backing unclear as it's now allocated by the device driver
    and not the submitter as usual.

    It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
    the queue supports discard operations or not. blkdev_issue_discard now
    allocates a one-page, sector-length payload which is the right thing
    for the common ATA and SCSI implementations.

    The mtd implementation of prepare_discard_fn() is replaced with simply
    checking for the request being a discard.

    Largely based on a previous patch from Matthew Wilcox
    which did the prepare_discard_fn but not the different payload allocation
    yet.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
    introduced in commit 1d54ad6da9192fed5dd3b60224d9f2dfea0dcd82.
    Release kobject also in case the request_fn is NULL.

    Problem was noticed via kmemleak backtrace when some sysfs entries were
    note properly destroyed during device removal:

    unreferenced object 0xffff88001aa76640 (size 80):
    comm "lvcreate", pid 2120, jiffies 4294885144
    hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e......
    90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S.....
    backtrace:
    [] kmemleak_alloc+0x26/0x60
    [] kmem_cache_alloc+0x133/0x1c0
    [] sysfs_new_dirent+0x41/0x120
    [] sysfs_add_file_mode+0x3c/0xb0
    [] internal_create_group+0xc1/0x1a0
    [] sysfs_create_group+0x13/0x20
    [] blk_trace_init_sysfs+0x14/0x20
    [] blk_register_queue+0x3c/0xf0
    [] add_disk+0x94/0x160
    [] dm_create+0x598/0x6e0 [dm_mod]
    [] dev_create+0x51/0x350 [dm_mod]
    [] ctl_ioctl+0x1a3/0x240 [dm_mod]
    [] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
    [] compat_sys_ioctl+0xcd/0x4f0
    [] sysenter_dispatch+0x7/0x2c
    [] 0xffffffffffffffff

    Signed-off-by: Zdenek Kabelac
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Zdenek Kabelac
     
  • Stacking devices do not have an inherent max_hw_sector limit. Set the
    default to INT_MAX so we are bounded only by capabilities of the
    underlying storage.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The topology changes unintentionally caused SAFE_MAX_SECTORS to be set
    for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and
    provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw
    drivers that depend on the old behavior.

    Acked-by: Mike Snitzer
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

20 Sep, 2009

1 commit

  • This allows subsytems to provide devtmpfs with non-default permissions
    for the device node. Instead of the default mode of 0600, null, zero,
    random, urandom, full, tty, ptmx now have a mode of 0666, which allows
    non-privileged processes to access standard device nodes in case no
    other userspace process applies the expected permissions.

    This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

16 Sep, 2009

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
    debugfs: Modify default debugfs directory for debugging pktcdvd.
    debugfs: Modified default dir of debugfs for debugging UHCI.
    debugfs: Change debugfs directory of IWMC3200
    debugfs: Change debuhgfs directory of trace-events-sample.h
    debugfs: Fix mount directory of debugfs by default in events.txt
    hpilo: add poll f_op
    hpilo: add interrupt handler
    hpilo: staging for interrupt handling
    driver core: platform_device_add_data(): use kmemdup()
    Driver core: Add support for compatibility classes
    uio: add generic driver for PCI 2.3 devices
    driver-core: move dma-coherent.c from kernel to driver/base
    mem_class: fix bug
    mem_class: use minor as index instead of searching the array
    driver model: constify attribute groups
    UIO: remove 'default n' from Kconfig
    Driver core: Add accessor for device platform data
    Driver core: move dev_get/set_drvdata to drivers/base/dd.c
    Driver core: add new device to bus's list before probing

    Linus Torvalds
     
  • Let attribute group vectors be declared "const". We'd
    like to let most attribute metadata live in read-only
    sections... this is a start.

    Signed-off-by: David Brownell
    Signed-off-by: Greg Kroah-Hartman

    David Brownell
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

15 Sep, 2009

1 commit

  • * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
    block: use blkdev_issue_discard in blk_ioctl_discard
    Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
    block: don't assume device has a request list backing in nr_requests store
    block: Optimal I/O limit wrapper
    cfq: choose a new next_req when a request is dispatched
    Seperate read and write statistics of in_flight requests
    aoe: end barrier bios with EOPNOTSUPP
    block: trace bio queueing trial only when it occurs
    block: enable rq CPU completion affinity by default
    cfq: fix the log message after dispatched a request
    block: use printk_once
    cciss: memory leak in cciss_init_one()
    splice: update mtime and atime on files
    block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
    cfq-iosched: get rid of must_alloc flag
    block: use interrupts disabled version of raise_softirq_irqoff()
    block: fix comment in blk-iopoll.c
    block: adjust default budget for blk-iopoll
    block: fix long lines in block/blk-iopoll.c
    block: add blk-iopoll, a NAPI like approach for block devices
    ...

    Linus Torvalds
     

14 Sep, 2009

5 commits

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Stacked devices do not. For now, just error out with -EINVAL. Later
    we could make the limit apply on stacked devices too, for throttling
    reasons.

    This fixes

    5a54cd13353bb3b88887604e2c980aa01e314309

    and should go into 2.6.31 stable as well.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
    around it. DM needs this to avoid poking at the queue_limits directly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • This patch addresses http://bugzilla.kernel.org/show_bug.cgi?id=13401, a
    regression introduced in 2.6.30.

    From the bug report:

    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • Currently, there is a single in_flight counter measuring the number of
    requests in the request_queue. But some monitoring tools would like to
    know how many read requests and write requests are in progress. Split the
    current in_flight counter into two seperate counters for read and write.

    This information is exported as a sysfs attribute, as changing the
    currently available stat files would break the existing tools.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

11 Sep, 2009

1 commit

  • If BIO is discarded or cross over end of device,
    BIO queueing trial doesn't occur.

    Actually the trace was called just before make_request at first:
    [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
         2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4

    And then 2 patches added some checks between them:
    [PATCH] md: check bio address after mapping through partitions
           5ddfe9691c91a244e8d1be597b6428fcefd58103,
    [BLOCK] Don't allow empty barriers to be passed down to
    queues that don't grok them
           51fd77bd9f512ab6cc9df0733ba1caaab89eb957

    It breaks original goal.
    Let's trace it only when it happens.

    Signed-off-by: Minchan Kim
    Acked-by: Wu Fengguang
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Minchan Kim