24 May, 2007

2 commits

  • Send an uevent to user space to indicate that a media change event has
    occurred.

    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kristen Carlson Accardi
     
  • Allow user space to determine if a disk supports Asynchronous Notification of
    media changes. This is done by adding a new sysfs file "capability_flags",
    which is documented in (insert file name). This sysfs file will export all
    disk capabilities flags to user space. We also define a new flag to define
    the media change notification capability.

    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kristen Carlson Accardi
     

16 May, 2007

1 commit


11 May, 2007

1 commit

  • to generic_make_request can use up a lot of space, and we would rather they
    didn't.

    As generic_make_request is a void function, and as it is generally not
    expected that it will have any effect immediately, it is safe to delay any
    call to generic_make_request until there is sufficient stack space
    available.

    As ->bi_next is reserved for the driver to use, it can have no valid value
    when generic_make_request is called, and as __make_request implicitly
    assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
    certain that all callers set it to NULL. We can therefore safely use
    bi_next to link pending requests together, providing we clear it before
    making the real call.

    So, we choose to allow each thread to only be active in one
    generic_make_request at a time. If a subsequent (recursive) call is made,
    the bio is linked into a per-thread list, and is handled when the active
    call completes.

    As the list of pending bios is per-thread, there are no locking issues to
    worry about.

    I say above that it is "safe to delay any call...". There are, however,
    some behaviours of a make_request_fn which would make it unsafe. These
    include any behaviour that assumes anything will have changed after a
    recursive call to generic_make_request.

    These could include:
    - waiting for that call to finish and call it's bi_end_io function.
    md use to sometimes do this (marking the superblock dirty before
    completing a write) but doesn't any more
    - inspecting the bio for fields that generic_make_request might
    change, such as bi_sector or bi_bdev. It is hard to see a good
    reason for this, and I don't think anyone actually does it.
    - inspecing the queue to see if, e.g. it is 'full' yet. Again, I
    think this is very unlikely to be useful, or to be done.

    Signed-off-by: Neil Brown
    Cc: Jens Axboe
    Cc:

    Alasdair G Kergon said:

    I can see nothing wrong with this in principle.

    For device-mapper at the moment though it's essential that, while the bio
    mappings may now get delayed, they still get processed in exactly
    the same order as they were passed to generic_make_request().

    My main concern is whether the timing changes implicit in this patch
    will make the rare data-corrupting races in the existing snapshot code
    more likely. (I'm working on a fix for these races, but the unfinished
    patch is already several hundred lines long.)

    It would be helpful if some people on this mailing list would test
    this patch in various scenarios and report back.

    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Neil Brown
     

10 May, 2007

5 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
    sound: convert "sound" subdirectory to UTF-8
    MAINTAINERS: Add cxacru website/mailing list
    include files: convert "include" subdirectory to UTF-8
    general: convert "kernel" subdirectory to UTF-8
    documentation: convert the Documentation directory to UTF-8
    Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
    remove broken URLs from net drivers' output
    Magic number prefix consistency change to Documentation/magic-number.txt
    trivial: s/i_sem /i_mutex/
    fix file specification in comments
    drivers/base/platform.c: fix small typo in doc
    misc doc and kconfig typos
    Remove obsolete fat_cvf help text
    Fix occurrences of "the the "
    Fix minor typoes in kernel/module.c
    Kconfig: Remove reference to external mqueue library
    Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
    Correct comments in genrtc.c to refer to correct /proc file.
    Fix more "deprecated" spellos.
    Fix "deprecated" typoes.
    ...

    Fix trivial comment conflict in kernel/relay.c.

    Linus Torvalds
     
  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
    (this was possible from the very beginnig, I missed this). So we can unify
    flush_work_keventd and flush_work.

    Also, rename flush_work() to cancel_work_sync() and fix all callers.
    Perhaps this is not the best name, but "flush_work" is really bad.

    (akpm: this is why the earlier patches bypassed maintainers)

    Signed-off-by: Oleg Nesterov
    Cc: Jeff Garzik
    Cc: "David S. Miller"
    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Auke Kok ,
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Switch the kblockd flushing from a global flush to a more specific
    flush_work().

    (akpm: bypassed maintainers, sorry. There are other patches which depend on
    this)

    Cc: "Maciej W. Rozycki"
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Display all possible partitions when the root filesystem is not mounted.
    This helps to track spell'o's and missing drivers.

    Updated to work with newer kernels.

    Example output:

    VFS: Cannot open root device "foobar" or unknown-block(0,0)
    Please append a correct "root=" boot option; here are the available partitions:
    0800 8388608 sda driver: sd
    0801 192748 sda1
    0802 8193150 sda2
    0810 4194304 sdb driver: sd
    Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

    [akpm@linux-foundation.org: cleanups, fix printk warnings]
    Signed-off-by: Jan Engelhardt
    Cc: Dave Gilbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Gilbert
     

09 May, 2007

4 commits

  • Signed-off-by: Michael Opdenacker
    Signed-off-by: Adrian Bunk

    Michael Opdenacker
     
  • * 'for-2.6.22' of git://git.kernel.dk/data/git/linux-2.6-block:
    [PATCH] ll_rw_blk: fix missing bounce in blk_rq_map_kern()
    [PATCH] splice: always call into page_cache_readahead()
    [PATCH] splice(): fix interaction with readahead

    Linus Torvalds
     
  • Fix units mismatch (jiffies vs msecs) in as-iosched.c, spotted by Xiaoning
    Ding .

    Signed-off-by: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • I think we might just need the blk_map_kern users now. For the async
    execute I added the bounce code already and the block SG_IO has it
    atleady. I think the blk_map_kern bounce code got dropped because we
    thought the correct gfp_t would be passed in. But I think all we need is
    the patch below and all the paths are take care of. The patch is not
    tested. Patch was made against scsi-misc.

    The last place that is sending non sg commands may just be md/dm-emc.c
    but that is is just waiting on alasdair to take some patches that fix
    that and a bunch of junk in there including adding bounce support. If
    the patch below is ok though and dm-emc finally gets converted then it
    will have sg and bonce buffer support.

    Signed-off-by: Mike Christie
    Signed-off-by: Jens Axboe

    Mike Christie
     

08 May, 2007

2 commits

  • This patch provides a new macro

    KMEM_CACHE(, )

    to simplify slab creation. KMEM_CACHE creates a slab with the name of the
    struct, with the size of the struct and with the alignment of the struct.
    Additional slab flags may be specified if necessary.

    Example

    struct test_slab {
    int a,b,c;
    struct list_head;
    } __cacheline_aligned_in_smp;

    test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

    will create a new slab named "test_slab" of the size sizeof(struct
    test_slab) and aligned to the alignment of test slab. If it fails then we
    panic.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
    been used in 6 years (so akpm says).

    find * -name \*.[ch] | xargs grep -l invalidate_bdev |
    while read file; do
    quilt add $file;
    sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
    done

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

06 May, 2007

1 commit

  • * master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (87 commits)
    [SCSI] fusion: fix domain validation loops
    [SCSI] qla2xxx: fix regression on sparc64
    [SCSI] modalias for scsi devices
    [SCSI] sg: cap reserved_size values at max_sectors
    [SCSI] BusLogic: stop using check_region
    [SCSI] tgt: fix rdma transfer bugs
    [SCSI] aacraid: fix aacraid not finding device
    [SCSI] aacraid: Correct SMC products in aacraid.txt
    [SCSI] scsi_error.c: Add EH Start Unit retry
    [SCSI] aacraid: [Fastboot] Panics for AACRAID driver during 'insmod' for kexec test.
    [SCSI] ipr: Driver version to 2.3.2
    [SCSI] ipr: Faster sg list fetch
    [SCSI] ipr: Return better qc_issue errors
    [SCSI] ipr: Disrupt device error
    [SCSI] ipr: Improve async error logging level control
    [SCSI] ipr: PCI unblock config access fix
    [SCSI] ipr: Fix for oops following SATA request sense
    [SCSI] ipr: Log error for SAS dual path switch
    [SCSI] ipr: Enable logging of debug error data for all devices
    [SCSI] ipr: Add new PCI-E IDs to device table
    ...

    Linus Torvalds
     

03 May, 2007

1 commit


30 Apr, 2007

20 commits

  • Jens Axboe
     
  • It's never grabbed from irq context, so just make it plain spin_lock().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We often lookup the same queue many times in succession, so cache
    the last looked up queue to avoid browsing the rbtree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • To be used by as/cfq as they see fit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq hash is no more necessary. We always can get cfqq from io context.
    cfq_get_io_context_noalloc() function is introduced, because we don't
    want to allocate cic on merging and checking may_queue. In order to
    identify sync queue we've used hash key = CFQ_KEY_ASYNC. Since hash is
    eliminated we need to use other criterion: sync flag for queue is added.
    In all places where we dig in rb_tree we're in current context, so no
    additional locking is required.

    Advantages of this patch: no additional memory for hash, no seeking in
    hash, code is cleaner. But it is necessary now to seek cic in per-ioc
    rbtree, but it is faster:
    - most processes work only with few devices
    - most systems have only few block devices
    - it is a rb-tree

    Signed-off-by: Vasily Tarasov

    Changes by me:

    - Merge into CFQ devel branch
    - Get rid of cfq_get_io_context_noalloc()
    - Fix various bugs with dereferencing cic->cfqq[] with offset other
    than 0 or 1.
    - Fix bug in cfqq setup, is_sync condition was reversed.
    - Fix bug where only bio_sync() is used, we need to check for a READ too

    Signed-off-by: Jens Axboe

    Vasily Tarasov
     
  • For tagged devices, allow overlap of requests if the idle window
    isn't enabled on the current active queue.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't enable it by default, don't let it get enabled during
    runtime.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can track it fairly accurately locally, let the slice handling
    take care of the rest.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't use it anymore in the slice expiry handling.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's only used for preemption now that the IDLE and RT queues also
    use the rbtree. If we pass an 'add_front' variable to
    cfq_service_tree_add(), we can set ->rb_key to 0 to force insertion
    at the front of the tree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Use the max_slice-cur_slice as the multipler for the insertion offset.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Same treatment as the RT conversion, just put the sorted idle
    branch at the end of the tree.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently CFQ does a linked insert into the current list for RT
    queues. We can just factor the class into the rb insertion,
    and then we don't have to treat RT queues in a special way. It's
    faster, too.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For cases where the rbtree is mainly used for sorting and min retrieval,
    a nice speedup of the rbtree code is to maintain a cache of the leftmost
    node in the tree.

    Also spotted in the CFS CPU scheduler code.

    Improved by Alan D. Brunelle by updating the
    leftmost hint in cfq_rb_first() if it isn't set, instead of only
    updating it on insert.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Drawing on some inspiration from the CFS CPU scheduler design, overhaul
    the pending cfq_queue concept list management. Currently CFQ uses a
    doubly linked list per priority level for sorting and service uses.
    Kill those lists and maintain an rbtree of cfq_queue's, sorted by when
    to service them.

    This unfortunately means that the ionice levels aren't as strong
    anymore, will work on improving those later. We only scale the slice
    time now, not the number of times we service. This means that latency
    is better (for all priority levels), but that the distinction between
    the highest and lower levels aren't as big.

    The diffstat speaks for itself.

    cfq-iosched.c | 363 +++++++++++++++++---------------------------------
    1 file changed, 125 insertions(+), 238 deletions(-)

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • - Move the queue_new flag clear to when the queue is selected
    - Only select the non-first queue in cfq_get_best_queue(), if there's
    a substantial difference between the best and first.
    - Get rid of ->busy_rr
    - Only select a close cooperator, if the current queue is known to take
    a while to "think".

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • - Implement logic for detecting cooperating processes, so we
    choose the best available queue whenever possible.

    - Improve residual slice time accounting.

    - Remove dead code: we no longer see async requests coming in on
    sync queues. That part was removed a long time ago. That means
    that we can also remove the difference between cfq_cfqq_sync()
    and cfq_cfqq_class_sync(), they are now indentical. And we can
    kill the on_dispatch array, just make it a counter.

    - Allow a process to go into the current list, if it hasn't been
    serviced in this scheduler tick yet.

    Possible future improvements including caching the cfqq lookup
    in cfq_close_cooperator(), so we don't have to look it up twice.
    cfq_get_best_queue() should just use that last decision instead
    of doing it again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When testing the syslet async io approach, I discovered that CFQ
    sometimes didn't perform as well as expected. cfq_should_preempt()
    needs to better check for cooperating tasks, so fix that by allowing
    preemption of an equal priority queue if the recently queued request
    is as good a candidate for IO as the one we are currently waiting for.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2007

1 commit

  • There's a really rare and obscure bug in CFQ, that causes a crash in
    cfq_dispatch_insert() due to rq == NULL. One example of the resulting
    oops is seen here:

    http://lkml.org/lkml/2007/4/15/41

    Neil correctly diagnosed the situation for how this can happen: if two
    concurrent requests with the exact same sector number (due to direct IO
    or aliasing between MD and the raw device access), the alias handling
    will add the request to the sortlist, but next_rq remains NULL.

    Read the more complete analysis at:

    http://lkml.org/lkml/2007/4/25/57

    This looks like it requires md to trigger, even though it should
    potentially be possible to due with O_DIRECT (at least if you edit the
    kernel and doctor some of the unplug calls).

    The fix is to move the ->next_rq update to when we add a request to the
    rbtree. Then we remove the possibility for a request to exist in the
    rbtree code, but not have ->next_rq correctly updated.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

21 Apr, 2007

1 commit

  • We have a 10-15% performance regression for sequential writes on TCQ/NCQ
    enabled drives in 2.6.21-rcX after the CFQ update went in. It has been
    reported by Valerie Clement and the Intel
    testing folks. The regression is because of CFQ's now more aggressive
    queue control, limiting the depth available to the device.

    This patches fixes that regression by allowing a greater depth when only
    one queue is busy. It has been tested to not impact sync-vs-async
    workloads too much - we still do a lot better than 2.6.20.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

18 Apr, 2007

1 commit

  • This patch (as857) modifies the SG_GET_RESERVED_SIZE and
    SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
    the device's request_queue's max_sectors value. This will permit
    cdrecord to obtain a legal value for the maximum transfer length,
    fixing Bugzilla #7026.

    The patch also caps the initial reserved_size value. There's no
    reason to have a reserved buffer larger than max_sectors, since it
    would be impossible to use the extra space.

    The corresponding ioctls in the block layer are modified similarly,
    and the initial value for the reserved_size is set as large as
    possible. This will effectively make it default to max_sectors.
    Note that the actual value is meaningless anyway, since block devices
    don't have a reserved buffer.

    Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
    uniform way for users to determine the actual max_sectors value for
    any raw SCSI transport.

    Signed-off-by: Alan Stern
    Acked-by: Jens Axboe
    Acked-by: Douglas Gilbert
    Signed-off-by: James Bottomley

    Alan Stern