30 Apr, 2007

20 commits

  • Jens Axboe
     
  • It's never grabbed from irq context, so just make it plain spin_lock().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We often lookup the same queue many times in succession, so cache
    the last looked up queue to avoid browsing the rbtree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • To be used by as/cfq as they see fit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq hash is no more necessary. We always can get cfqq from io context.
    cfq_get_io_context_noalloc() function is introduced, because we don't
    want to allocate cic on merging and checking may_queue. In order to
    identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is
    eliminated we need to use other criterion: sync flag for queue is added.
    In all places where we dig in rb_tree we're in current context, so no
    additional locking is required.

    Advantages of this patch: no additional memory for hash, no seeking in
    hash, code is cleaner. But it is necessary now to seek cic in per-ioc
    rbtree, but it is faster:
    - most processes work only with few devices
    - most systems have only few block devices
    - it is a rb-tree

    Signed-off-by: Vasily Tarasov

    Changes by me:

    - Merge into CFQ devel branch
    - Get rid of cfq_get_io_context_noalloc()
    - Fix various bugs with dereferencing cic->cfqq[] with offset other
    than 0 or 1.
    - Fix bug in cfqq setup, is_sync condition was reversed.
    - Fix bug where only bio_sync() is used, we need to check for a READ too

    Signed-off-by: Jens Axboe

    Vasily Tarasov
     
  • For tagged devices, allow overlap of requests if the idle window
    isn't enabled on the current active queue.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't enable it by default, don't let it get enabled during
    runtime.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can track it fairly accurately locally, let the slice handling
    take care of the rest.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't use it anymore in the slice expiry handling.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's only used for preemption now that the IDLE and RT queues also
    use the rbtree. If we pass an 'add_front' variable to
    cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion
    at the front of the tree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Use the max_slice-cur_slice as the multipler for the insertion offset.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Same treatment as the RT conversion, just put the sorted idle
    branch at the end of the tree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently CFQ does a linked insert into the current list for RT
    queues. We can just factor the class into the rb insertion,
    and then we don't have to treat RT queues in a special way. It's
    faster, too.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For cases where the rbtree is mainly used for sorting and min retrieval,
    a nice speedup of the rbtree code is to maintain a cache of the leftmost
    node in the tree.

    Also spotted in the CFS CPU scheduler code.

    Improved by Alan D. Brunelle by updating the
    leftmost hint in cfq_rb_first() if it isn't set, instead of only
    updating it on insert.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Drawing on some inspiration from the CFS CPU scheduler design, overhaul
    the pending cfq_queue concept list management. Currently CFQ uses a
    doubly linked list per priority level for sorting and service uses.
    Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
    to service them.

    This unfortunately means that the ionice levels aren't as strong
    anymore, will work on improving those later. We only scale the slice
    time now, not the number of times we service. This means that latency
    is better (for all priority levels), but that the distinction between
    the highest and lower levels aren't as big.

    The diffstat speaks for itself.

    cfq-iosched.c | 363 +++++++++++++++++---------------------------------
    1 file changed, 125 insertions(+), 238 deletions(-)

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • - Move the queue_new flag clear to when the queue is selected
    - Only select the non-first queue in cfq_get_best_queue(), if there's
    a substantial difference between the best and first.
    - Get rid of ->busy_rr
    - Only select a close cooperator, if the current queue is known to take
    a while to "think".

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • - Implement logic for detecting cooperating processes, so we
    choose the best available queue whenever possible.

    - Improve residual slice time accounting.

    - Remove dead code: we no longer see async requests coming in on
    sync queues. That part was removed a long time ago. That means
    that we can also remove the difference between cfq_cfqq_sync()
    and cfq_cfqq_class_sync(), they are now indentical. And we can
    kill the on_dispatch array, just make it a counter.

    - Allow a process to go into the current list, if it hasn't been
    serviced in this scheduler tick yet.

    Possible future improvements including caching the cfqq lookup
    in cfq_close_cooperator(), so we don't have to look it up twice.
    cfq_get_best_queue() should just use that last decision instead
    of doing it again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When testing the syslet async io approach, I discovered that CFQ
    sometimes didn't perform as well as expected. cfq_should_preempt()
    needs to better check for cooperating tasks, so fix that by allowing
    preemption of an equal priority queue if the recently queued request
    is as good a candidate for IO as the one we are currently waiting for.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2007

1 commit

  • There's a really rare and obscure bug in CFQ, that causes a crash in
    cfq_dispatch_insert() due to rq == NULL. One example of the resulting
    oops is seen here:

    http://lkml.org/lkml/2007/4/15/41

    Neil correctly diagnosed the situation for how this can happen: if two
    concurrent requests with the exact same sector number (due to direct IO
    or aliasing between MD and the raw device access), the alias handling
    will add the request to the sortlist, but next_rq remains NULL.

    Read the more complete analysis at:

    http://lkml.org/lkml/2007/4/25/57

    This looks like it requires md to trigger, even though it should
    potentially be possible to due with O_DIRECT (at least if you edit the
    kernel and doctor some of the unplug calls).

    The fix is to move the ->next_rq update to when we add a request to the
    rbtree. Then we remove the possibility for a request to exist in the
    rbtree code, but not have ->next_rq correctly updated.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

21 Apr, 2007

1 commit

  • We have a 10-15% performance regression for sequential writes on TCQ/NCQ
    enabled drives in 2.6.21-rcX after the CFQ update went in. It has been
    reported by Valerie Clement and the Intel
    testing folks. The regression is because of CFQ's now more aggressive
    queue control, limiting the depth available to the device.

    This patches fixes that regression by allowing a greater depth when only
    one queue is busy. It has been tested to not impact sync-vs-async
    workloads too much - we still do a lot better than 2.6.20.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

05 Apr, 2007

1 commit

  • Revert all this. It can cause device-mapper to receive a different major from
    earlier kernels and it turns out that the Amanda backup program (via GNU tar,
    apparently) checks major numbers on files when performing incremental backups.

    Which is a bit broken of Amanda (or tar), but this feature isn't important
    enough to justify the churn.

    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Mar, 2007

2 commits

  • Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
    machines in the bootlog:

    io scheduler noop registeredTime: jiffies clocksource has been installed.

    io scheduler deadline registered (default)

    Looking at block/elevator.c, it appears that elv_register() uses two
    consecutive printks in a non-atomic way, leading to the above glitch. The
    attached trivial patch fixes this issue, by using a single printk.

    Signed-off-by: Thibaut VARENE
    Signed-off-by: Jens Axboe

    Thibaut VARENE
     
  • There is a small problem in handling page bounce.

    At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
    possible _number_ of a page frame, but the _amount_ of page frames. For
    example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
    0xFFFF.

    request_queue structure has a member q->bounce_pfn and queue needs bounce
    pages for the pages _above_ this limit. This routine is handled by
    blk_queue_bounce(), where the following check is produced:

    if (q->bounce_pfn >= blk_max_pfn)
    return;

    Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
    equals 0x10000. In such situation the check above fails and for each bio
    we always fall down for iterating over pages tied to the bio.

    I want to notice, that for quite a big range of device drivers (ide, md,
    ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
    bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
    then the check above doesn't fail. But for other drivers, which obtain
    reuired value from drivers, it fails. For example sata_nv uses
    ATA_DMA_MASK or dev->dma_mask.

    I propose to use (max_pfn - 1) for blk_max_pfn. And the same for
    blk_max_low_pfn. The patch also cleanses some checks related with
    bounce_pfn.

    Signed-off-by: Vasily Tarasov
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Vasily Tarasov
     

21 Feb, 2007

2 commits

  • >=============================================
    >[ INFO: possible recursive locking detected ]
    >2.6.19-1.2909.fc7 #1
    >---------------------------------------------
    >anaconda/587 is trying to acquire lock:
    > (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >but task is already holding lock:
    > (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >other info that might help us debug this:
    >1 lock held by anaconda/587:
    > #0: (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >stack backtrace:
    > [] show_trace_log_lvl+0x1a/0x2f
    > [] show_trace+0x12/0x14
    > [] dump_stack+0x16/0x18
    > [] __lock_acquire+0x116/0xa09
    > [] lock_acquire+0x56/0x6f
    > [] __mutex_lock_slowpath+0xe5/0x24a
    > [] mutex_lock+0x21/0x24
    > [] blkdev_ioctl+0x600/0x76d
    > [] block_ioctl+0x1b/0x1f
    > [] do_ioctl+0x22/0x68
    > [] vfs_ioctl+0x252/0x265
    > [] sys_ioctl+0x49/0x63
    > [] syscall_call+0x7/0xb

    Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
    clarifying the bd_mutex locking, because I confused myself and initially
    thought the lock order was wrong too.

    Signed-off-by: Peter Zijlstra
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Several people have reported failures in dynamic major device number handling
    due to the recent changes in there to avoid handing out the local/experimental
    majors.

    Rolf reports that this is due to a gcc-4.1.0 bug.

    The patch refactors that code a lot in an attempt to provoke the compiler into
    behaving.

    Cc: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

18 Feb, 2007

1 commit


13 Feb, 2007

2 commits

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=7922, dynamic
    blockdev major allocation can hand out majors which LANANA has defined as
    being for local/experimental use.

    Cc: Torben Mathiasen
    Cc: Greg KH
    Cc: Al Viro
    Cc: Tomas Klas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

12 Feb, 2007

10 commits