15 Jun, 2006

1 commit

  • We don't clear the seek stat values in cfq_alloc_io_context(), and if
    ->seek_mean is unlucky enough to be set to -36 by chance, the first
    invocation of cfq_update_io_seektime() will oops with a divide by zero
    in do_div().

    Just memset the entire cic instead of filling invididual values
    independently.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

09 Jun, 2006

1 commit

  • There's a race between shutting down one io scheduler and firing up the
    next, in which a new io could enter and cause the io scheduler to be
    invoked with bad or NULL data.

    To fix this, we need to maintain the queue lock for a bit longer.
    Unfortunately we cannot do that, since the elevator init requires to be
    run without the lock held. This isn't easily fixable, without also
    changing the mempool API. So split the initialization into two parts,
    and alloc-init operation and an attach operation. Then we can
    preallocate the io scheduler and related structures, and run the attach
    inside the lock after we detach the old one.

    This patch has survived 30 minutes of 1 second io scheduler switching
    with a very busy io load.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

02 Jun, 2006

1 commit


01 Jun, 2006

4 commits


31 May, 2006

1 commit


24 May, 2006

1 commit

  • While executing barrrier sequence, the bar_rq which carries actual
    write was accounted as normal IO on completion, while it wasn't on
    queueing. This caused gendisk->in_flight to be decremented by 1 after
    each barrier thus messed up statistics.

    This patch makes bar_rq not accounted as normal IO. As the containing
    barrier request as a whole is accounted, part of it shouldn't be.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

13 May, 2006

1 commit

  • This reverts commit 56cf6504fc1c0c221b82cebc16a444b684140fb7.

    Both Erik Mouw and Andrew Vasquez independently pinpointed this commit
    as causing problems, where the slab cache for a driver is never released
    (most obviously causing problems when immediately re-loading that
    driver, resulting in a "kmem_cache_create: duplicate cache "
    message, but it can also cause other trouble).

    James Bottomley dug into it, and reports:

    "OK, here's the scoop. The problem patch adds a get of driverfs_dev in
    add_disk(), but doesn't put it again until disk_release() (which occurs
    on final put_disk() of the gendisk).

    However, in SCSI, the driverfs_dev is the sdev_gendev. That means
    there's a reference held on sdev_gendev until final disk put.
    Unfortunately, we use the driver model driver_remove to trigger
    del_gendisk (which removes the gendisk from visibility and decrements
    the refcount), so we've introduced an unbreakable deadlock in the
    reference counting with this.

    I suggest simply reversing this patch at the moment. If Russell and
    Jens can tell me what they're trying to do I'll see if there's another
    way to do it."

    so hereby the patch gets reverted, waiting for a better fix.

    Cc: Jens Axboe
    Cc: Russell King
    Cc: James Bottomley
    Cc: Erik Mouw
    Cc: Andrew Vasquez
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 May, 2006

1 commit

  • Don't recurse back into the driver even if the unplug threshold is met,
    when the driver asks for a requeue. This is both silly from a logical
    point of view (requeues typically happen due to driver/hardware
    shortage), and also dangerous since we could hit an endless request_fn
    -> requeue -> unplug -> request_fn loop and crash on stack overrun.

    Also limit blk_run_queue() to one level of recursion, similar to how
    blk_start_queue() works.

    This patch fixed a real problem with SLES10 and lpfc, and it could hit
    any SCSI lld that returns non-zero from it's ->queuecommand() handler.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

06 May, 2006

1 commit

  • The block layer keeps a reference (driverfs_dev) to the struct
    device associated with the block device, and uses it internally
    for generating uevents in block_uevent.

    Block device uevents include umounting the partition, which can
    occur after the backing device has been removed.

    Unfortunately, this reference is not counted. This means that
    if the struct device is removed from the device tree, the block
    layers reference will become stale.

    Guard against this by holding a reference to the struct device
    in add_disk(), and only drop the reference when we're releasing
    the gendisk kobject - in other words when we can be sure that no
    further uevents will be generated for this block device.

    Signed-off-by: Russell King
    Acked-by: Jens Axboe

    Russell King
     

26 Apr, 2006

1 commit

  • Few of the notifier_chain_register() callers use __devinitdata in the
    definition of notifier_block data structure. It is incorrect as the
    data structure should be available after the initializations (they do
    not unregister them during initializations).

    This was leading to an oops when notifier_chain_register() call is
    invoked for those callback chains after initialization.

    This patch fixes all such usages to _not_ have the notifier_block data
    structure in the init data section.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     

20 Apr, 2006

2 commits


19 Apr, 2006

1 commit


18 Apr, 2006

2 commits

  • When queue dies, we set cic->key=NULL as dead mark. So, when we
    traverse a rbtree, we must check whether it's still valid key. if it
    was invalidated, drop it, then restart the traversal from top.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Jens Axboe

    OGAWA Hirofumi
     
  • On rmmod path, cfq/as waits to make sure all io-contexts was
    freed. However, it's using complete(), not wait_for_completion().

    I think barrier() is not enough in here. To avoid the following case,
    this patch replaces barrier() with smb_wmb().

    cpu0 visibility cpu1
    [ioc_gnone=NULL,ioc_count=1]

    ioc_gnone = &all_gone NULL,ioc_count=1
    atomic_read(&ioc_count) NULL,ioc_count=1
    wait_for_completion() NULL,ioc_count=0 atomic_sub_and_test()
    NULL,ioc_count=0 if ( && ioc_gone)
    [ioc_gone==NULL,
    so doesn't call complete()]
    &all_gone,ioc_count=0

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Jens Axboe

    OGAWA Hirofumi
     

13 Apr, 2006

1 commit

  • We currently have two implementations of this obsolete ioctl, one in
    the block layer and one in the scsi code. Both of them have drawbacks.

    This patch kills the scsi layer version after updating the block version
    with the missing bits:

    - argument checking
    - use scatterlist I/O
    - set number of retries based on the submitted command

    This is the last user of non-S/G I/O except for the gdth driver, so
    getting this in ASAP and through the scsi tree would be nie to kill
    the non-S/G I/O path. Jens, what do you think about adding a check
    for non-S/G I/O in the midlayer?

    Thanks to Or Gerlitz for testing this patch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: James Bottomley

    Christoph Hellwig
     

02 Apr, 2006

1 commit


01 Apr, 2006

3 commits

  • The help text says that if you select CONFIG_LBD, then it will automatically
    select CONFIG_LFS. That isn't currently the case, so update the text.

    - Get rid of the cruft in the help text mentioning CONFIG_LBD

    - Tell unsure users to select CONFIG_LFS.

    - Remove the `default n'.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • The boot cmdline is parsed in parse_early_param() and
    parse_args(,unknown_bootoption).

    And __setup() is used in obsolete_checksetup().

    start_kernel()
    -> parse_args()
    -> unknown_bootoption()
    -> obsolete_checksetup()

    If __setup()'s callback (->setup_func()) returns 1 in
    obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
    handled.

    If ->setup_func() returns 0, obsolete_checksetup() tries other
    ->setup_func(). If all ->setup_func() that matched a parameter returns 0,
    a parameter is seted to argv_init[].

    Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
    If the app doesn't ignore those arguments, it will warning and exit.

    This patch fixes a wrong usage of it, however fixes obvious one only.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Make baby-simple the code for /proc/devices. Based on the proven design
    for /proc/interrupts.

    This also fixes the early-termination regression 2.6.16 introduced, as
    demonstrated by:

    # dd if=/proc/devices bs=1
    Character devices:
    1 mem
    27+0 records in
    27+0 records out

    This should also work (but is untested) when /proc/devices >4096 bytes,
    which I believe is what the original 2.6.16 rewrite fixed.

    [akpm@osdl.org: cleanups, simplifications]
    Signed-off-by: Joe Korty
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Korty
     

29 Mar, 2006

2 commits


28 Mar, 2006

6 commits


27 Mar, 2006

6 commits

  • debugfs depends on sysfs, so make blktrace kconfig option depend
    on that.

    Reported by Adrian Bunk.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The kernel's representation of the disk statistics uses the type unsigned
    which is 32b on both 32b and 64b platforms. Unfortunately, most system
    tools that work with these numbers that are exported in /proc/diskstats
    including iostat read these numbers into unsigned longs. This works fine
    on 32b platforms and when the number of IO transactions are small on 64b
    platforms. However, when the numbers wrap on 64b platforms & you read the
    numbers into unsigned longs, and compare the numbers to previous readings,
    then you get an unsigned representation of a negative number. This looks
    like a very large 64b number & gives you bizarre readouts in iostat:

    ilc4: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
    ilc4: sda 5.50 0.00 143.96 0.00 307496983987862656.00 0.00 153748491993931328.00 0.00 2136028725038430.00 7.94 55.12 5.59 80.42

    Though fixing iostat in user space is possible, and a quick survey
    indicates that several other similar tools also use unsigned longs when
    processing /proc/diskstats. Therefore, it seems like a better approach
    would be to extend the length of the disk_stats structure on 64b
    architectures to 64b. The following patch does that. It should not affect
    the operation on 32b platforms.

    Signed-off-by: Ben Woodard
    Cc: Rick Lindsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Ben Woodard
     
  • Both elv_add_request() and generic_unplug_device() grab the queue lock
    and disable interrupts, do that locally and use the __ variants.

    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Andrew Morton
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add blkcnt_t as the type of inode.i_blocks. This enables you to make the size
    of blkcnt_t either 4 bytes or 8 bytes on 32 bits architecture with CONFIG_LSF.

    - CONFIG_LSF
    Add new configuration parameter.
    - blkcnt_t
    On h8300, i386, mips, powerpc, s390 and sh that define sector_t,
    blkcnt_t is defined as u64 if CONFIG_LSF is enabled; otherwise it is
    defined as unsigned long.
    On other architectures, it is defined as unsigned long.
    - inode.i_blocks
    Change the type from sector_t to blkcnt_t.

    Signed-off-by: Takashi Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     
  • Modify well over a dozen mempool users to call mempool_create_slab_pool()
    rather than calling mempool_create() with extra arguments, saving about 30
    lines of code and increasing readability.

    Signed-off-by: Matthew Dobson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dobson
     

25 Mar, 2006

1 commit


24 Mar, 2006

1 commit


23 Mar, 2006

1 commit

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven