06 Dec, 2008

1 commit

  • There's no point in having too short SG_IO timeouts, since if the
    command does end up timing out, we'll end up through the reset sequence
    that is several seconds long in order to abort the command that timed
    out.

    As a result, shorter timeouts than a few seconds simply do not make
    sense, as the recovery would be longer than the timeout itself.

    Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT.

    Suggested-by: Alan Cox
    Acked-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Jeff Garzik
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Dec, 2008

2 commits

  • Update FMODE_NDELAY before each ioctl call so that we can kill the
    magic FMODE_NDELAY_NOW. It would be even better to do this directly
    in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
    not just block special files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Commit 33c2dca4957bd0da3e1af7b96d0758d97e708ef6 (trim file propagation
    in block/compat_ioctl.c) removed the handling of some ioctls from
    compat_blkdev_driver_ioctl. That caused them to be rejected as unknown
    by the compat layer.

    Signed-off-by: Andreas Schwab
    Cc: Al Viro
    Signed-off-by: Al Viro

    Andreas Schwab
     

03 Dec, 2008

4 commits

  • Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
    devices.

    When stacking devices (LVM over MD over SCSI) some of the request queue
    parameters are not set up correctly in some cases by default, namely
    max_segment_size and and seg_boundary mask.

    If you create MD device over SCSI, these attributes are zeroed.

    Problem become when there is over this mapping next device-mapper mapping
    - queue attributes are set in DM this way:

    request_queue max_segment_size seg_boundary_mask
    SCSI 65536 0xffffffff
    MD RAID1 0 0
    LVM 65536 -1 (64bit)

    Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
    physical segments according to these parameters.

    During the generic_make_request() is segment cout recalculated and can
    increase bio->bi_phys_segments count over the allowed limit. (After
    bio_clone() in stack operation.)

    Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

    (MAXSEGENTRIES is 31 by default.)

    Sometimes even this command is enough to cause oops:

    dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

    This command generates bios with 250 sectors, allocated in 32 4k-pages
    (last page uses only 1024 bytes).

    For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
    unfortunatelly on lower layer it is recalculated to 32 segments and this
    violates CCISS restriction and triggers BUG_ON().

    The patch tries to fix it by:

    * initializing attributes above in queue request constructor
    blk_queue_make_request()

    * make sure that blk_queue_stack_limits() inherits setting

    (DM uses its own function to set the limits because it
    blk_queue_stack_limits() was introduced later. It should probably switch
    to use generic stack limit function too.)

    * sets the default seg_boundary value in one place (blkdev.h)

    * use this mask as default in DM (instead of -1, which differs in 64bit)

    Bugs related to this:
    https://bugzilla.redhat.com/show_bug.cgi?id=471639
    http://bugzilla.kernel.org/show_bug.cgi?id=8672

    Signed-off-by: Milan Broz
    Reviewed-by: Alasdair G Kergon
    Cc: Neil Brown
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Miller
    Signed-off-by: Jens Axboe

    Milan Broz
     
  • blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
    both start the timeout timer. Barrier code dequeues the original
    barrier request but doesn't passes the request itself to lower level
    driver, only broken down proxy requests; however, as the original
    barrier code goes through the same dequeue path and timeout timer is
    started on it. If barrier sequence takes long enough, this timer
    expires but the low level driver has no idea about this request and
    oops follows.

    Timeout timer shouldn't have been started on the original barrier
    request as it never goes through actual IO. This patch unexports
    elv_dequeue_request(), which has no external user anyway, and makes it
    operate on elevator proper w/o adding the timer and make
    blkdev_dequeue_request() call elv_dequeue_request() and add timer.
    Internal users which don't pass the request to driver - barrier code
    and end_that_request_last() - are converted to use
    elv_dequeue_request().

    Signed-off-by: Tejun Heo
    Cc: Mike Anderson
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • disk->node_id will be refered in allocating in disk_expand_part_tbl, so we
    should set it before disk->node_id is refered.

    Signed-off-by: Cheng Renquan
    Signed-off-by: Jens Axboe

    Cheng Renquan
     
  • mapping. Which is good if pages were mapped - but if they were provided
    by someone else and just copied then bad things happen - pages are
    released once here, and once by caller, leading to user triggerable BUG
    at include/linux/mm.h:246.

    Signed-off-by: Petr Vandrovec
    Signed-off-by: Jens Axboe

    Petr Vandrovec
     

18 Nov, 2008

3 commits

  • If the size passed in is OK but we end up mapping too many segments,
    we call the unmap path directly like from IO completion. But from IO
    completion we have an extra reference to the bio, so this error case
    goes OOPS when it attempts to free and already free bio.

    Fix it by getting an extra reference to the bio before calling the
    unmap failure case.

    Reported-by: Petr Vandrovec

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We run into system boot failure with kernel 2.6.28-rc. We found it on a
    couple of machines, including T61 notebook, nehalem machine, and another
    HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
    With kernel prior to 2.6.28-rc, system boot doesn't fail.

    I debug it and locate the root cause. Pls. see
    http://bugzilla.kernel.org/show_bug.cgi?id=11899
    https://bugzilla.redhat.com/show_bug.cgi?id=471517

    As a matter of fact, there are 2 bugs.

    1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
    and fails once. nash has a bug. Some of its functions misuse return
    value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
    0 means nash gets an uevent, but the uevent isn't block-related (for
    exmaple, usb). If by coincidence, kernel tells nash that uevents are
    available, but kernel also set timeout, nash might stops collecting
    other uevents in queue if current uevent isn't block-related. I work
    out a patch for nash to fix it.
    http://bugzilla.kernel.org/attachment.cgi?id=18858

    2) root=LABEL=/, system always can't boot. initrd init reports
    switchroot fails. Here is an executation branch of nash when booting:
    (1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
    (2) nash query /proc/devices with the major number; It found line
    "8 sd";
    (3) nash use 'sd' to search its own probe table to find device (DISK)
    type for the device and add it to its own list;
    (4) Later on, it probes all devices in its list to get filesystem
    labels; scsi register "8 sd" always.

    When major is 259, nash fails to find the device(DISK) type. I enables
    CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
    for device /dev/sda1, which causes nash to fail to find device (DISK)
    type.

    To fixing issue 2), I create a patch for nash and another patch for
    kernel.

    http://bugzilla.kernel.org/attachment.cgi?id=18859
    http://bugzilla.kernel.org/attachment.cgi?id=18837

    Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
    block device in proc/devices.

    With 2 patches on nash and 1 patch on kernel, I boot my machines for
    dozens of times without failure.

    Signed-off-by Zhang Yanmin
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Zhang, Yanmin
     
  • Make add_partition() return pointer to the new hd_struct on success
    and ERR_PTR() value on failure. This change will be used to fix md
    autodetection bug.

    Signed-off-by: Tejun Heo
    Cc: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Nov, 2008

4 commits

  • This patch (as1159b) changes the timeout routines in the block core to
    use round_jiffies_up(). There's no point in rounding the timer
    deadline down, since if it expires too early we will have to restart
    it.

    The patch also removes some unnecessary tests when a request is
    removed from the queue's timer list.

    Signed-off-by: Alan Stern
    Signed-off-by: Jens Axboe

    Alan Stern
     
  • Move the calling blk_delete_timer to later in end_that_request_last to
    address an issue where blkdev_dequeue_request may have add a timer for the
    request.

    Signed-off-by: Mike Anderson
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Anderson
     
  • Block queue supports two usage models - one where block driver peeks
    at the front of queue using elv_next_request(), processes it and
    finishes it and the other where block driver peeks at the front of
    queue, dequeue the request using blkdev_dequeue_request() and finishes
    it. The latter is more flexible as it allows the driver to process
    multiple commands concurrently.

    These two inconsistent usage models affect the block layer
    implementation confusing. For some, elv_next_request() is considered
    the issue point while others consider blkdev_dequeue_request() the
    issue point.

    Till now the inconsistency mostly affect only accounting, so it didn't
    really break anything seriously; however, with block layer timeout,
    this inconsistency hits hard. Block layer considers
    elv_next_request() the issue point and adds timer but SCSI layer
    thinks it was just peeking and when the request can't process the
    command right away, it's just left there without further processing.
    This makes the request dangling on the timer list and, when the timer
    goes off, the request which the SCSI layer and below think is still on
    the block queue ends up in the EH queue, causing various problems - EH
    hang (failed count goes over busy count and EH never wakes up),
    WARN_ON() and oopses as low level driver trying to handle the unknown
    command, etc. depending on the timing.

    As SCSI midlayer is the only user of block layer timer at the moment,
    moving blk_add_timer() to elv_dequeue_request() fixes the problem;
    however, this two usage models definitely need to be cleaned up in the
    future.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

24 Oct, 2008

2 commits

  • * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
    proc: remove fs/proc/proc_misc.c
    proc: move /proc/vmcore creation to fs/proc/vmcore.c
    proc: move pagecount stuff to fs/proc/page.c
    proc: move all /proc/kcore stuff to fs/proc/kcore.c
    proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
    proc: move /proc/modules boilerplate to kernel/module.c
    proc: move /proc/diskstats boilerplate to block/genhd.c
    proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmstat boilerplate to mm/vmstat.c
    proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
    proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmallocinfo to mm/vmalloc.c
    proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
    proc: move /proc/slab_allocators boilerplate to mm/slab.c
    proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
    proc: move /proc/stat to fs/proc/stat.c
    proc: move rest of /proc/partitions code to block/genhd.c
    proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
    proc: move /proc/devices code to fs/proc/devices.c
    proc: move rest of /proc/locks to fs/locks.c
    ...

    Linus Torvalds
     
  • Variable 'ret' is no longer used. Don't declare it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Oct, 2008

2 commits


21 Oct, 2008

12 commits


18 Oct, 2008

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: remove __generic_unplug_device() from exports
    block: move q->unplug_work initialization
    blktrace: pass zfcp driver data
    blktrace: add support for driver data
    block: fix current kernel-doc warnings
    block: only call ->request_fn when the queue is not stopped
    block: simplify string handling in elv_iosched_store()
    block: fix kernel-doc for blk_alloc_devt()
    block: fix nr_phys_segments miscalculation bug
    block: add partition attribute for partition number
    block: add BIG FAT WARNING to CONFIG_DEBUG_BLOCK_EXT_DEVT
    softirq: Add support for triggering softirq work on softirqs.

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (39 commits)
    [SCSI] sd: fix compile failure with CONFIG_BLK_DEV_INTEGRITY=n
    libiscsi: fix locking in iscsi_eh_device_reset
    libiscsi: check reason why we are stopping iscsi session to determine error value
    [SCSI] iscsi_tcp: return a descriptive error value during connection errors
    [SCSI] libiscsi: rename host reset to target reset
    [SCSI] iscsi class: fix endpoint id handling
    [SCSI] libiscsi: Support drivers initiating session removal
    [SCSI] libiscsi: fix data corruption when target has to resend data-in packets
    [SCSI] sd: Switch kernel printing level for DIF messages
    [SCSI] sd: Correctly handle all combinations of DIF and DIX
    [SCSI] sd: Always print actual protection_type
    [SCSI] sd: Issue correct protection operation
    [SCSI] scsi_error: fix target reset handling
    [SCSI] lpfc 8.2.8 v2 : Add statistical reporting control and additional fc vendor events
    [SCSI] lpfc 8.2.8 v2 : Add sysfs control of target queue depth handling
    [SCSI] lpfc 8.2.8 v2 : Revert target busy in favor of transport disrupted
    [SCSI] scsi_dh_alua: remove REQ_NOMERGE
    [SCSI] lpfc 8.2.8 : update driver version to 8.2.8
    [SCSI] lpfc 8.2.8 : Add MSI-X support
    [SCSI] lpfc 8.2.8 : Update driver to use new Host byte error code DID_TRANSPORT_DISRUPTED
    ...

    Linus Torvalds
     

17 Oct, 2008

8 commits

  • The only out-of-core user is IDE, and that should be using
    blk_start_queueing() instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • modprobe loop; rmmod loop effectively creates a blk_queue and destroys it
    which results in q->unplug_work being canceled without it ever being
    initialized.

    Therefore, move the initialization of q->unplug_work from
    blk_queue_make_request() to blk_alloc_queue*().

    Reported-by: Alexey Dobriyan
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Peter Zijlstra
     
  • Fix block kernel-doc warnings:

    Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path'
    Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu'
    Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part'
    Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Randy Dunlap
     
  • Callers should use either blk_run_queue/__blk_run_queue, or
    blk_start_queueing() to invoke request handling instead of calling
    ->request_fn() directly as that does not take the queue stopped
    flag into account.

    Also add appropriate comments on the above functions to detail
    their usage.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • strlcpy() guarantees the dest buffer is NULL teminated.

    Signed-off-by: Li Zefan
    Signed-off-by: Jens Axboe

    Li Zefan
     
  • No argument 'gfp_mask' for blk_alloc_devt().

    Signed-off-by: Li Zefan
    Signed-off-by: Jens Axboe

    Li Zefan
     
  • This fixes the bug reported by Nikanth Karthikesan :

    http://lkml.org/lkml/2008/10/2/203

    The root cause of the bug is that blk_phys_contig_segment
    miscalculates q->max_segment_size.

    blk_phys_contig_segment checks:

    req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size

    But blk_recalc_rq_segments might expect that req->biotail and the
    previous bio in the req are supposed be merged into one
    segment. blk_recalc_rq_segments might also expect that next_req->bio
    and the next bio in the next_req are supposed be merged into one
    segment. In such case, we merge two requests that can't be merged
    here. Later, blk_rq_map_sg gives more segments than it should.

    We need to keep track of segment size in blk_recalc_rq_segments and
    use it to see if two requests can be merged. This patch implements it
    in the similar way that we used to do for hw merging (virtual
    merging).

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • Now that device_create() has been audited, rename things back to the
    original call to be sane.

    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman