12 Oct, 2009

1 commit


02 Oct, 2009

4 commits

  • Currently we set the bio size to the byte equivalent of the blocks to
    be trimmed when submitting the initial DISCARD ioctl. That means it
    is subject to the max_hw_sectors limitation of the HBA which is
    much lower than the size of a DISCARD request we can support.
    Add a separate max_discard_sectors tunable to limit the size for discard
    requests.

    We limit the max discard request size in bytes to 32bit as that is the
    limit for bio->bi_size. This could be much larger if we had a way to pass
    that information through the block layer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • prepare_discard_fn() was being called in a place where memory allocation
    was effectively impossible. This makes it inappropriate for all but
    the most trivial translations of Linux's DISCARD operation to the block
    command set. Additionally adding a payload there makes the ownership
    of the bio backing unclear as it's now allocated by the device driver
    and not the submitter as usual.

    It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
    the queue supports discard operations or not. blkdev_issue_discard now
    allocates a one-page, sector-length payload which is the right thing
    for the common ATA and SCSI implementations.

    The mtd implementation of prepare_discard_fn() is replaced with simply
    checking for the request being a discard.

    Largely based on a previous patch from Matthew Wilcox
    which did the prepare_discard_fn but not the different payload allocation
    yet.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Stacking devices do not have an inherent max_hw_sector limit. Set the
    default to INT_MAX so we are bounded only by capabilities of the
    underlying storage.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The topology changes unintentionally caused SAFE_MAX_SECTORS to be set
    for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and
    provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw
    drivers that depend on the old behavior.

    Acked-by: Mike Snitzer
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

14 Sep, 2009

1 commit


01 Aug, 2009

4 commits


28 Jul, 2009

1 commit

  • Move the assignment of a default lock below blk_init_queue() to
    blk_queue_make_request(), so we also get to set the default lock
    for ->make_request_fn() based drivers. This is important since the
    queue flag locking requires a lock to be in place.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Jun, 2009

1 commit


18 Jun, 2009

1 commit


16 Jun, 2009

2 commits


12 Jun, 2009

1 commit


09 Jun, 2009

2 commits


03 Jun, 2009

1 commit

  • blk_queue_bounce_limit() is more than a wrapper about the request queue
    limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
    be called by stacking drivers that wish to set the bounce limit
    explicitly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

28 May, 2009

1 commit


23 May, 2009

4 commits

  • To support devices with physical block sizes bigger than 512 bytes we
    need to ensure proper alignment. This patch adds support for exposing
    I/O topology characteristics as devices are stacked.

    logical_block_size is the smallest unit the device can address.

    physical_block_size indicates the smallest I/O the device can write
    without incurring a read-modify-write penalty.

    The io_min parameter is the smallest preferred I/O size reported by
    the device. In many cases this is the same as the physical block
    size. However, the io_min parameter can be scaled up when stacking
    (RAID5 chunk size > physical block size).

    The io_opt characteristic indicates the optimal I/O size reported by
    the device. This is usually the stripe width for arrays.

    The alignment_offset parameter indicates the number of bytes the start
    of the device/partition is offset from the device's natural alignment.
    Partition tools and MD/DM utilities can use this to pad their offsets
    so filesystems start on proper boundaries.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • To accommodate stacking drivers that do not have an associated request
    queue we're moving the limits to a separate, embedded structure.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Convert all external users of queue limits to using wrapper functions
    instead of poking the request queue variables directly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

22 Apr, 2009

1 commit

  • Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily

    All DMA address limits are expressed in terms of the last addressable
    unit (byte or page) instead of one plus that. However, when
    determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(),
    it compares the specified limit against 0x100000000UL to determine
    whether it's below 4G ending up falsely setting GFP_DMA in
    q->bounce_gfp.

    As DMA zone is very small on x86_64, this makes larger SG_IO transfers
    very eager to trigger OOM killer. Fix it. While at it, rename the
    parameter to @dma_mask for clarity and convert comment to proper
    winged style.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Apr, 2009

1 commit

  • Fix a typo (this was in the original patch but was not merged when the code
    fixes were for some reason)

    Signed-off-by: Alan Cox
    Signed-off-by: Jeff Garzik

    Alan Cox
     

29 Dec, 2008

1 commit

  • zero is invalid for max_phys_segments, max_hw_segments, and
    max_segment_size. It's better to use use min_not_zero instead of
    min. min() works though (because the commit 0e435ac makes sure that
    these values are set to the default values, non zero, if a queue is
    initialized properly).

    With this patch, blk_queue_stack_limits does the almost same thing
    that dm's combine_restrictions_low() does. I think that it's easy to
    remove dm's combine_restrictions_low.

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

03 Dec, 2008

1 commit

  • Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
    devices.

    When stacking devices (LVM over MD over SCSI) some of the request queue
    parameters are not set up correctly in some cases by default, namely
    max_segment_size and and seg_boundary mask.

    If you create MD device over SCSI, these attributes are zeroed.

    Problem become when there is over this mapping next device-mapper mapping
    - queue attributes are set in DM this way:

    request_queue max_segment_size seg_boundary_mask
    SCSI 65536 0xffffffff
    MD RAID1 0 0
    LVM 65536 -1 (64bit)

    Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
    physical segments according to these parameters.

    During the generic_make_request() is segment cout recalculated and can
    increase bio->bi_phys_segments count over the allowed limit. (After
    bio_clone() in stack operation.)

    Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

    (MAXSEGENTRIES is 31 by default.)

    Sometimes even this command is enough to cause oops:

    dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

    This command generates bios with 250 sectors, allocated in 32 4k-pages
    (last page uses only 1024 bytes).

    For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
    unfortunatelly on lower layer it is recalculated to 32 segments and this
    violates CCISS restriction and triggers BUG_ON().

    The patch tries to fix it by:

    * initializing attributes above in queue request constructor
    blk_queue_make_request()

    * make sure that blk_queue_stack_limits() inherits setting

    (DM uses its own function to set the limits because it
    blk_queue_stack_limits() was introduced later. It should probably switch
    to use generic stack limit function too.)

    * sets the default seg_boundary value in one place (blkdev.h)

    * use this mask as default in DM (instead of -1, which differs in 64bit)

    Bugs related to this:
    https://bugzilla.redhat.com/show_bug.cgi?id=471639
    http://bugzilla.kernel.org/show_bug.cgi?id=8672

    Signed-off-by: Milan Broz
    Reviewed-by: Alasdair G Kergon
    Cc: Neil Brown
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Miller
    Signed-off-by: Jens Axboe

    Milan Broz
     

17 Oct, 2008

1 commit

  • modprobe loop; rmmod loop effectively creates a blk_queue and destroys it
    which results in q->unplug_work being canceled without it ever being
    initialized.

    Therefore, move the initialization of q->unplug_work from
    blk_queue_make_request() to blk_alloc_queue*().

    Reported-by: Alexey Dobriyan
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Peter Zijlstra
     

09 Oct, 2008

6 commits

  • This patch adds an new interface, blk_lld_busy(), to check lld's
    busy state from the block layer.
    blk_lld_busy() calls down into low-level drivers for the checking
    if the drivers set q->lld_busy_fn() using blk_queue_lld_busy().

    This resolves a performance problem on request stacking devices below.

    Some drivers like scsi mid layer stop dispatching request when
    they detect busy state on its low-level device like host/target/device.
    It allows other requests to stay in the I/O scheduler's queue
    for a chance of merging.

    Request stacking drivers like request-based dm should follow
    the same logic.
    However, there is no generic interface for the stacked device
    to check if the underlying device(s) are busy.
    If the request stacking driver dispatches and submits requests to
    the busy underlying device, the requests will stay in
    the underlying device's queue without a chance of merging.
    This causes performance problem on burst I/O load.

    With this patch, busy state of the underlying device is exported
    via q->lld_busy_fn(). So the request stacking driver can check it
    and stop dispatching requests if busy.

    The underlying device driver must return the busy state appropriately:
    1: when the device driver can't process requests immediately.
    0: when the device driver can process requests immediately,
    including abnormal situations where the device driver needs
    to kill all requests.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • Right now SCSI and others do their own command timeout handling.
    Move those bits to the block layer.

    Instead of having a timer per command, we try to be a bit more clever
    and simply have one per-queue. This avoids the overhead of having to
    tear down and setup a timer for each command, so it will result in a lot
    less timer fiddling.

    Signed-off-by: Mike Anderson
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Noticed by sparse:
    block/blk-softirq.c:156:12: warning: symbol 'blk_softirq_init' was not declared. Should it be static?
    block/genhd.c:583:28: warning: function 'bdget_disk' with external linkage has definition
    block/genhd.c:659:17: warning: incorrect type in argument 1 (different base types)
    block/genhd.c:659:17: expected unsigned int [unsigned] [usertype] size
    block/genhd.c:659:17: got restricted gfp_t
    block/genhd.c:659:29: warning: incorrect type in argument 2 (different base types)
    block/genhd.c:659:29: expected restricted gfp_t [usertype] flags
    block/genhd.c:659:29: got unsigned int
    block: kmalloc args reversed

    Signed-off-by: Harvey Harrison
    Signed-off-by: Jens Axboe

    Harvey Harrison
     
  • This patch adds support for controlling the IO completion CPU of
    either all requests on a queue, or on a per-request basis. We export
    a sysfs variable (rq_affinity) which, if set, migrates completions
    of requests to the CPU that originally submitted it. A bio helper
    (bio_set_completion_cpu()) is also added, so that queuers can ask
    for completion on that specific CPU.

    In testing, this has been show to cut the system time by as much
    as 20-40% on synthetic workloads where CPU affinity is desired.

    This requires a little help from the architecture, so it'll only
    work as designed for archs that are using the new generic smp
    helper infrastructure.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • …in them as needed. Fix changed function parameter names. Fix typos/spellos. In comments, change REQ_SPECIAL to REQ_TYPE_SPECIAL and REQ_BLOCK_PC to REQ_TYPE_BLOCK_PC.

    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

    Randy Dunlap
     
  • Some block devices benefit from a hint that they can forget the contents
    of certain sectors. Add basic support for this to the block core, along
    with a 'blkdev_issue_discard()' helper function which issues such
    requests.

    The caller doesn't get to provide an end_io functio, since
    blkdev_issue_discard() will automatically split the request up into
    multiple bios if appropriate. Neither does the function wait for
    completion -- it's expected that callers won't care about when, or even
    _if_, the request completes. It's only a hint to the device anyway. By
    definition, the file system doesn't _care_ about these sectors any more.

    [With feedback from OGAWA Hirofumi and
    Jens Axboe
    Signed-off-by: Jens Axboe

    David Woodhouse
     

04 Jul, 2008

1 commit

  • This adds blk_queue_update_dma_pad to prevent LLDs from overwriting
    the dma pad mask wrongly (we added blk_queue_update_dma_alignment due
    to the same reason).

    This also converts libata to use blk_queue_update_dma_pad instead of
    blk_queue_dma_pad.

    Signed-off-by: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Thomas Bogendoerfer
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

15 May, 2008

1 commit

  • As setting and clearing queue flags now requires that we hold a spinlock
    on the queue, and as blk_queue_stack_limits is called without that lock,
    get the lock inside blk_queue_stack_limits.

    For blk_queue_stack_limits to be able to find the right lock, each md
    personality needs to set q->queue_lock to point to the appropriate lock.
    Those personalities which didn't previously use a spin_lock, us
    q->__queue_lock. So always initialise that lock when allocated.

    With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
    longer cause warnings as it will be clear that the proper lock is held.

    Thanks to Dan Williams for review and fixing the silly bugs.

    Signed-off-by: NeilBrown
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Alistair John Strachan
    Cc: Nick Piggin
    Cc: "Rafael J. Wysocki"
    Cc: Jacek Luczak
    Cc: Prakash Punnoor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

01 May, 2008

1 commit


29 Apr, 2008

2 commits