07 Jan, 2009

1 commit


31 Dec, 2008

1 commit


30 Dec, 2008

1 commit


29 Dec, 2008

22 commits

  • Original patch from Nikanth Karthikesan

    When a queue exits the queue lock is taken and cfq_exit_queue() would free all
    the cic's associated with the queue.

    But when a task exits, cfq_exit_io_context() gets cic one by one and then
    locks the associated queue to call __cfq_exit_single_io_context. It looks like
    between getting a cic from the ioc and locking the queue, the queue might have
    exited on another cpu.

    Fix this by rechecking the cfq_io_context queue key inside the queue lock
    again, and not calling into __cfq_exit_single_io_context() if somebody
    beat us to it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We have two seperate config entries for large devices/files. One
    is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
    that handles large files. This doesn't make a lot of sense, you typically
    want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
    to indicate that it covers both.

    Acked-by: Jean Delvare
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Sparse asked whether these could be static.

    Signed-off-by: Roel Kluin
    Signed-off-by: Jens Axboe

    Roel Kluin
     
  • zero is invalid for max_phys_segments, max_hw_segments, and
    max_segment_size. It's better to use use min_not_zero instead of
    min. min() works though (because the commit 0e435ac makes sure that
    these values are set to the default values, non zero, if a queue is
    initialized properly).

    With this patch, blk_queue_stack_limits does the almost same thing
    that dm's combine_restrictions_low() does. I think that it's easy to
    remove dm's combine_restrictions_low.

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • disk_map_sector_rcu() returns a partition from a sector offset,
    which we use for IO statistics on a per-partition basis. The
    lookup itself is an O(N) list lookup, where N is the number of
    partitions. This actually hurts performance quite a bit, even
    on the lower end partitions. On higher numbered partitions,
    it can get pretty bad.

    Solve this by adding a one-hit cache for partition lookup.
    This makes the lookup O(1) for the case where we do most IO to
    one partition. Even for mixed partition workloads, amortized cost
    is pretty close to O(1) since the natural IO batching makes the
    one-hit cache last for lots of IOs.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This basically limits the hardware queue depth to 4*quantum at any
    point in time, which is 16 with the default settings. As CFQ uses
    other means to shrink the hardware queue when necessary in the first
    place, there's really no need for this extra heuristic. Additionally,
    it ends up hurting performance in some cases.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just use struct elevator_queue everywhere instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We just want to hand the first bits of IO to the device as fast
    as possible. Gains a few percent on the IOPS rate.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Empty barrier on write-through (or no cache) w/ ordered tag has no
    command to execute and without any command to execute ordered tag is
    never issued to the device and the ordering is never achieved. Force
    draining for such cases.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Empty barrier required special handling in __elv_next_request() to
    complete it without letting the low level driver see it.

    With previous changes, barrier code is now flexible enough to skip the
    BAR step using the same barrier sequence selection mechanism. Drop
    the special handling and mask off q->ordered from start_ordered().

    Remove blk_empty_barrier() test which now has no user.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Barrier completion had the following assumptions.

    * start_ordered() couldn't finish the whole sequence properly. If all
    actions are to be skipped, q->ordseq is set correctly but the actual
    completion was never triggered thus hanging the barrier request.

    * Drain completion in elv_complete_request() assumed that there's
    always at least one request in the queue when drain completes.

    Both assumptions are true but these assumptions need to be removed to
    improve empty barrier implementation. This patch makes the following
    changes.

    * Make start_ordered() use blk_ordered_complete_seq() to mark skipped
    steps complete and notify __elv_next_request() that it should fetch
    the next request if the whole barrier has completed inside
    start_ordered().

    * Make drain completion path in elv_complete_request() check whether
    the queue is empty. Empty queue also indicates drain completion.

    * While at it, convert 0/1 return from blk_do_ordered() to false/true.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • In all barrier sequences, the barrier write itself was always assumed
    to be issued and thus didn't have corresponding control flag. This
    patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
    start_ordered() such that any barrier action can be skipped.

    This patch doesn't introduce any visible behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Because barrier mode can be changed dynamically, whether barrier is
    supported or not can be determined only when actually issuing the
    barrier and there is no point in checking it earlier. Drop barrier
    support check in generic_make_request() and __make_request(), and
    update comment around the support check in blk_do_ordered().

    * There is no reason to check discard support in both
    generic_make_request() and __make_request(). Drop the check in
    __make_request(). While at it, move error action block to the end
    of the function and add unlikely() to q existence test.

    * Barrier request, be it empty or not, is never passed to low level
    driver and thus it's meaningless to try to copy back req->sector to
    bio->bi_sector on error. In addition, the notion of failed sector
    doesn't make any sense for empty barrier to begin with. Drop the
    code block from __end_that_request_first().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Separate out ordering type (drain,) and action masks (preflush,
    postflush, fua) from visible ordering mode selectors
    (QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
    while action masks are named QUEUE_ORDERED_DO_*.

    This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
    optional to improve empty barrier implementation.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • After many improvements on kblockd_flush_work, it is now identical to
    cancel_work_sync, so a direct call to cancel_work_sync is suggested.

    The only difference is that cancel_work_sync is a GPL symbol,
    so no non-GPL modules anymore.

    Signed-off-by: Cheng Renquan
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Cheng Renquan
     
  • Allow the scsi request REQ_QUIET flag to be propagated to the buffer
    file system layer. The basic ideas is to pass the flag from the scsi
    request to the bio (block IO) and then to the buffer layer. The buffer
    layer can then suppress needless printks.

    This patch declutters the kernel log by removed the 40-50 (per lun)
    buffer io error messages seen during a boot in my multipath setup . It
    is a good chance any real errors will be missed in the "noise" it the
    logs without this patch.

    During boot I see blocks of messages like
    "
    __ratelimit: 211 callbacks suppressed
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242847
    Buffer I/O error on device sdm, logical block 1
    Buffer I/O error on device sdm, logical block 5242878
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242872
    "
    in my logs.

    My disk environment is multipath fiber channel using the SCSI_DH_RDAC
    code and multipathd. This topology includes an "active" and "ghost"
    path for each lun. IO's to the "ghost" path will never complete and the
    SCSI layer, via the scsi device handler rdac code, quick returns the IOs
    to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
    layer messages.

    I am wanting to extend the QUIET behavior to include the buffer file
    system layer to deal with these errors as well. I have been running this
    patch for a while now on several boxes without issue. A few runs of
    bonnie++ show no noticeable difference in performance in my setup.

    Thanks for John Stultz for the quiet_error finalization.

    Submitted-by: Keith Mannthey
    Signed-off-by: Jens Axboe

    Keith Mannthey
     
  • There's no need to take queue_lock or kernel_lock when modifying
    bdi->ra_pages. So remove them. Also remove out of date comment for
    queue_max_sectors_store().

    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • There is no argument named @tags in blk_init_tags,
    remove its' comment.

    Signed-off-by: Qinghuang Feng
    Signed-off-by: Jens Axboe

    Qinghuang Feng
     
  • Convert the timeout ioctl scalling to use the clock_t functions
    which are much more accurate with some USER_HZ vs HZ combinations.

    Signed-off-by: Milton Miller
    Signed-off-by: Jens Axboe

    Milton Miller
     
  • For sync IO, we'll often do them serialized. This means we'll be touching
    the queue timer for every IO, as opposed to only occasionally like we
    do for queued IO. Instead of deleting the timer when the last request
    is removed, just let continue running. If a new request comes up soon
    we then don't have to readd the timer again. If no new requests arrive,
    the timer will expire without side effect later.

    This improves high iops sync IO by ~1%.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now the rq->deadline can't be zero if the request is in the
    timeout_list, so there is no need to have next_set. There is no need to
    access a request's deadline field if blk_rq_timed_out is called on it.

    Signed-off-by: Malahal Naineni
    Signed-off-by: Jens Axboe

    malahal@us.ibm.com
     

26 Dec, 2008

1 commit


19 Dec, 2008

1 commit


06 Dec, 2008

1 commit

  • There's no point in having too short SG_IO timeouts, since if the
    command does end up timing out, we'll end up through the reset sequence
    that is several seconds long in order to abort the command that timed
    out.

    As a result, shorter timeouts than a few seconds simply do not make
    sense, as the recovery would be longer than the timeout itself.

    Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT.

    Suggested-by: Alan Cox
    Acked-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Jeff Garzik
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Dec, 2008

1 commit


04 Dec, 2008

2 commits

  • Update FMODE_NDELAY before each ioctl call so that we can kill the
    magic FMODE_NDELAY_NOW. It would be even better to do this directly
    in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
    not just block special files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Commit 33c2dca4957bd0da3e1af7b96d0758d97e708ef6 (trim file propagation
    in block/compat_ioctl.c) removed the handling of some ioctls from
    compat_blkdev_driver_ioctl. That caused them to be rejected as unknown
    by the compat layer.

    Signed-off-by: Andreas Schwab
    Cc: Al Viro
    Signed-off-by: Al Viro

    Andreas Schwab
     

03 Dec, 2008

4 commits

  • Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
    devices.

    When stacking devices (LVM over MD over SCSI) some of the request queue
    parameters are not set up correctly in some cases by default, namely
    max_segment_size and and seg_boundary mask.

    If you create MD device over SCSI, these attributes are zeroed.

    Problem become when there is over this mapping next device-mapper mapping
    - queue attributes are set in DM this way:

    request_queue max_segment_size seg_boundary_mask
    SCSI 65536 0xffffffff
    MD RAID1 0 0
    LVM 65536 -1 (64bit)

    Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
    physical segments according to these parameters.

    During the generic_make_request() is segment cout recalculated and can
    increase bio->bi_phys_segments count over the allowed limit. (After
    bio_clone() in stack operation.)

    Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

    (MAXSEGENTRIES is 31 by default.)

    Sometimes even this command is enough to cause oops:

    dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

    This command generates bios with 250 sectors, allocated in 32 4k-pages
    (last page uses only 1024 bytes).

    For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
    unfortunatelly on lower layer it is recalculated to 32 segments and this
    violates CCISS restriction and triggers BUG_ON().

    The patch tries to fix it by:

    * initializing attributes above in queue request constructor
    blk_queue_make_request()

    * make sure that blk_queue_stack_limits() inherits setting

    (DM uses its own function to set the limits because it
    blk_queue_stack_limits() was introduced later. It should probably switch
    to use generic stack limit function too.)

    * sets the default seg_boundary value in one place (blkdev.h)

    * use this mask as default in DM (instead of -1, which differs in 64bit)

    Bugs related to this:
    https://bugzilla.redhat.com/show_bug.cgi?id=471639
    http://bugzilla.kernel.org/show_bug.cgi?id=8672

    Signed-off-by: Milan Broz
    Reviewed-by: Alasdair G Kergon
    Cc: Neil Brown
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Miller
    Signed-off-by: Jens Axboe

    Milan Broz
     
  • blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
    both start the timeout timer. Barrier code dequeues the original
    barrier request but doesn't passes the request itself to lower level
    driver, only broken down proxy requests; however, as the original
    barrier code goes through the same dequeue path and timeout timer is
    started on it. If barrier sequence takes long enough, this timer
    expires but the low level driver has no idea about this request and
    oops follows.

    Timeout timer shouldn't have been started on the original barrier
    request as it never goes through actual IO. This patch unexports
    elv_dequeue_request(), which has no external user anyway, and makes it
    operate on elevator proper w/o adding the timer and make
    blkdev_dequeue_request() call elv_dequeue_request() and add timer.
    Internal users which don't pass the request to driver - barrier code
    and end_that_request_last() - are converted to use
    elv_dequeue_request().

    Signed-off-by: Tejun Heo
    Cc: Mike Anderson
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • disk->node_id will be refered in allocating in disk_expand_part_tbl, so we
    should set it before disk->node_id is refered.

    Signed-off-by: Cheng Renquan
    Signed-off-by: Jens Axboe

    Cheng Renquan
     
  • mapping. Which is good if pages were mapped - but if they were provided
    by someone else and just copied then bad things happen - pages are
    released once here, and once by caller, leading to user triggerable BUG
    at include/linux/mm.h:246.

    Signed-off-by: Petr Vandrovec
    Signed-off-by: Jens Axboe

    Petr Vandrovec
     

26 Nov, 2008

2 commits

  • Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
    sites. Spread them out to the usage sites, as suggested by
    Mathieu Desnoyers.

    Signed-off-by: Ingo Molnar
    Acked-by: Mathieu Desnoyers

    Ingo Molnar
     
  • This was a forward port of work done by Mathieu Desnoyers, I changed it to
    encode the 'what' parameter on the tracepoint name, so that one can register
    interest in specific events and not on classes of events to then check the
    'what' parameter.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

18 Nov, 2008

3 commits

  • If the size passed in is OK but we end up mapping too many segments,
    we call the unmap path directly like from IO completion. But from IO
    completion we have an extra reference to the bio, so this error case
    goes OOPS when it attempts to free and already free bio.

    Fix it by getting an extra reference to the bio before calling the
    unmap failure case.

    Reported-by: Petr Vandrovec

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We run into system boot failure with kernel 2.6.28-rc. We found it on a
    couple of machines, including T61 notebook, nehalem machine, and another
    HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
    With kernel prior to 2.6.28-rc, system boot doesn't fail.

    I debug it and locate the root cause. Pls. see
    http://bugzilla.kernel.org/show_bug.cgi?id=11899
    https://bugzilla.redhat.com/show_bug.cgi?id=471517

    As a matter of fact, there are 2 bugs.

    1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
    and fails once. nash has a bug. Some of its functions misuse return
    value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
    0 means nash gets an uevent, but the uevent isn't block-related (for
    exmaple, usb). If by coincidence, kernel tells nash that uevents are
    available, but kernel also set timeout, nash might stops collecting
    other uevents in queue if current uevent isn't block-related. I work
    out a patch for nash to fix it.
    http://bugzilla.kernel.org/attachment.cgi?id=18858

    2) root=LABEL=/, system always can't boot. initrd init reports
    switchroot fails. Here is an executation branch of nash when booting:
    (1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
    (2) nash query /proc/devices with the major number; It found line
    "8 sd";
    (3) nash use 'sd' to search its own probe table to find device (DISK)
    type for the device and add it to its own list;
    (4) Later on, it probes all devices in its list to get filesystem
    labels; scsi register "8 sd" always.

    When major is 259, nash fails to find the device(DISK) type. I enables
    CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
    for device /dev/sda1, which causes nash to fail to find device (DISK)
    type.

    To fixing issue 2), I create a patch for nash and another patch for
    kernel.

    http://bugzilla.kernel.org/attachment.cgi?id=18859
    http://bugzilla.kernel.org/attachment.cgi?id=18837

    Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
    block device in proc/devices.

    With 2 patches on nash and 1 patch on kernel, I boot my machines for
    dozens of times without failure.

    Signed-off-by Zhang Yanmin
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Zhang, Yanmin
     
  • Make add_partition() return pointer to the new hd_struct on success
    and ERR_PTR() value on failure. This change will be used to fix md
    autodetection bug.

    Signed-off-by: Tejun Heo
    Cc: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo