25 Apr, 2007

1 commit

  • There's a really rare and obscure bug in CFQ, that causes a crash in
    cfq_dispatch_insert() due to rq == NULL. One example of the resulting
    oops is seen here:

    http://lkml.org/lkml/2007/4/15/41

    Neil correctly diagnosed the situation for how this can happen: if two
    concurrent requests with the exact same sector number (due to direct IO
    or aliasing between MD and the raw device access), the alias handling
    will add the request to the sortlist, but next_rq remains NULL.

    Read the more complete analysis at:

    http://lkml.org/lkml/2007/4/25/57

    This looks like it requires md to trigger, even though it should
    potentially be possible to due with O_DIRECT (at least if you edit the
    kernel and doctor some of the unplug calls).

    The fix is to move the ->next_rq update to when we add a request to the
    rbtree. Then we remove the possibility for a request to exist in the
    rbtree code, but not have ->next_rq correctly updated.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

21 Apr, 2007

1 commit

  • We have a 10-15% performance regression for sequential writes on TCQ/NCQ
    enabled drives in 2.6.21-rcX after the CFQ update went in. It has been
    reported by Valerie Clement and the Intel
    testing folks. The regression is because of CFQ's now more aggressive
    queue control, limiting the depth available to the device.

    This patches fixes that regression by allowing a greater depth when only
    one queue is busy. It has been tested to not impact sync-vs-async
    workloads too much - we still do a lot better than 2.6.20.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

05 Apr, 2007

1 commit

  • Revert all this. It can cause device-mapper to receive a different major from
    earlier kernels and it turns out that the Amanda backup program (via GNU tar,
    apparently) checks major numbers on files when performing incremental backups.

    Which is a bit broken of Amanda (or tar), but this feature isn't important
    enough to justify the churn.

    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Mar, 2007

2 commits

  • Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
    machines in the bootlog:

    io scheduler noop registeredTime: jiffies clocksource has been installed.

    io scheduler deadline registered (default)

    Looking at block/elevator.c, it appears that elv_register() uses two
    consecutive printks in a non-atomic way, leading to the above glitch. The
    attached trivial patch fixes this issue, by using a single printk.

    Signed-off-by: Thibaut VARENE
    Signed-off-by: Jens Axboe

    Thibaut VARENE
     
  • There is a small problem in handling page bounce.

    At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
    possible _number_ of a page frame, but the _amount_ of page frames. For
    example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
    0xFFFF.

    request_queue structure has a member q->bounce_pfn and queue needs bounce
    pages for the pages _above_ this limit. This routine is handled by
    blk_queue_bounce(), where the following check is produced:

    if (q->bounce_pfn >= blk_max_pfn)
    return;

    Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
    equals 0x10000. In such situation the check above fails and for each bio
    we always fall down for iterating over pages tied to the bio.

    I want to notice, that for quite a big range of device drivers (ide, md,
    ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
    bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
    then the check above doesn't fail. But for other drivers, which obtain
    reuired value from drivers, it fails. For example sata_nv uses
    ATA_DMA_MASK or dev->dma_mask.

    I propose to use (max_pfn - 1) for blk_max_pfn. And the same for
    blk_max_low_pfn. The patch also cleanses some checks related with
    bounce_pfn.

    Signed-off-by: Vasily Tarasov
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Vasily Tarasov
     

21 Feb, 2007

2 commits

  • >=============================================
    >[ INFO: possible recursive locking detected ]
    >2.6.19-1.2909.fc7 #1
    >---------------------------------------------
    >anaconda/587 is trying to acquire lock:
    > (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >but task is already holding lock:
    > (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >other info that might help us debug this:
    >1 lock held by anaconda/587:
    > #0: (&bdev->bd_mutex){--..}, at: [] mutex_lock+0x21/0x24
    >
    >stack backtrace:
    > [] show_trace_log_lvl+0x1a/0x2f
    > [] show_trace+0x12/0x14
    > [] dump_stack+0x16/0x18
    > [] __lock_acquire+0x116/0xa09
    > [] lock_acquire+0x56/0x6f
    > [] __mutex_lock_slowpath+0xe5/0x24a
    > [] mutex_lock+0x21/0x24
    > [] blkdev_ioctl+0x600/0x76d
    > [] block_ioctl+0x1b/0x1f
    > [] do_ioctl+0x22/0x68
    > [] vfs_ioctl+0x252/0x265
    > [] sys_ioctl+0x49/0x63
    > [] syscall_call+0x7/0xb

    Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
    clarifying the bd_mutex locking, because I confused myself and initially
    thought the lock order was wrong too.

    Signed-off-by: Peter Zijlstra
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Several people have reported failures in dynamic major device number handling
    due to the recent changes in there to avoid handing out the local/experimental
    majors.

    Rolf reports that this is due to a gcc-4.1.0 bug.

    The patch refactors that code a lot in an attempt to provoke the compiler into
    behaving.

    Cc: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

18 Feb, 2007

1 commit


13 Feb, 2007

2 commits

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=7922, dynamic
    blockdev major allocation can hand out majors which LANANA has defined as
    being for local/experimental use.

    Cc: Torben Mathiasen
    Cc: Greg KH
    Cc: Al Viro
    Cc: Tomas Klas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

12 Feb, 2007

15 commits


11 Feb, 2007

1 commit

  • Some partitioning systems create special partitions that
    span the entire disk. One example are Sun partitions, and
    this whole-disk partition exists to tell the firmware the
    extent of the entire device so it can load the boot block
    and do other things.

    Such partitions should not be treated as normal partitions,
    because all the other partitions overlap this whole-disk one.
    So we'd see multiple instances of the same UUID etc. which
    we do not want. udev and friends can thus search for this
    'whole_disk' attribute and use it to decide to ignore the
    partition.

    Signed-off-by: Fabio Massimo Di Nitto
    Signed-off-by: David S. Miller

    Fabio Massimo Di Nitto
     

10 Feb, 2007

1 commit

  • It is possible for raid5 to be sent a bio that is too big for an underlying
    device. So if it is a READ that we pass stright down to a device, it will
    fail and confuse RAID5.

    So in 'chunk_aligned_read' we check that the bio fits within the parameters
    for the target device and if it doesn't fit, fall back on reading through
    the stripe cache and making lots of one-page requests.

    Note that this is the earliest time we can check against the device because
    earlier we don't have a lock on the device, so it could change underneath
    us.

    Also, the code for handling a retry through the cache when a read fails has
    not been tested and was badly broken. This patch fixes that code.

    Signed-off-by: Neil Brown
    Cc: "Kai"
    Cc:
    Cc:
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

30 Jan, 2007

1 commit

  • Commit 85e04e371b5a321b5df2bc3f8e0099a64fb087d7 cleaned up the timeout
    conversion, but did it exactly the wrong way. We get msecs from user
    space, and should convert them into jiffies. Not the other way around.

    Here is a fix with the overflow check sg.c has added in. This fixes DVD
    burnign with Nero.

    Signed-off-by: Mike Christie
    [ "you'll be wanting a comma there" - Andrew ]
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Christie
     

24 Jan, 2007

1 commit

  • A flag was recently added to the elevator code to avoid
    performing an unplug when reuests are being re-queued.
    The goal of this flag was to avoid a deep recursion that
    can occur when re-queueing requests after a SCSI device/host
    reset. See http://lkml.org/lkml/2006/5/17/254

    However, that fix added the flag near the bottom of a case
    statement, where an earlier break (in an if statement) could
    transport one out of the case, without setting the flag.
    This patch sets the flag earlier in the case statement.

    I re-discovered the deep recursion recently during testing;
    I was told that it was a known problem, and the fix to it was
    in the kernel I was testing. Indeed it was ... but it didn't
    fix the bug. With the patch below, I no longer see the bug.

    Signed-off by: Linas Vepstas
    Signed-off-by: Jens Axboe
    Cc: Chris Wright
    Signed-off-by: Linus Torvalds

    Linas Vepstas
     

03 Jan, 2007

1 commit

  • Two issues:

    - The final return 1 should be a return 0, otherwise comparing cfqq is
    a noop.

    - bio_sync() only checks the sync flag, while rq_is_sync() checks both
    for READ and sync. The latter is what we want. Expand the bio check
    to include reads, and relax the restriction to allow merging of async
    io into sync requests.

    In the future we want to clean up the SYNC logic, right now it means
    both sync request (such as READ and O_DIRECT WRITE) and unplug-on-issue.
    Leave that for later.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

23 Dec, 2006

2 commits


22 Dec, 2006

1 commit

  • The recent io scheduler allow_merge commit left the block layer with
    no merging, oops. This patch fixes that up.

    That means the CFQ change needs to be verified again, it might not fix
    the original bug now. But that's a seperate thing, I'll double check
    that tomorrow.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

20 Dec, 2006

1 commit

  • Currently we allow any merge, even if the io originates from different
    processes. This can cause really bad starvation and unfairness, if those
    ios happen to be synchronous (reads or direct writes).

    So add a allow_merge hook to the io scheduler ops, so an io scheduler can
    help decide whether a bio/process combination may be merged with an
    existing request.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Dec, 2006

5 commits


13 Dec, 2006

1 commit