26 Feb, 2010

4 commits


25 Feb, 2010

1 commit


23 Feb, 2010

1 commit

  • This reverts commit fb1e75389bd06fd5987e9cda1b4e0305c782f854.

    "Benjamin S." reports that the patch in question
    causes a big drop in sequential throughput for him, dropping from
    200MB/sec down to only 70MB/sec.

    Needs to be investigated more fully, for now lets just revert the
    offending commit.

    Conflicts:

    include/linux/blkdev.h

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jan, 2010

1 commit

  • Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
    merges at all are to be attempted (not even the simple one-hit cache).

    The following table illustrates the additional benefit - 5 minute runs of
    a random I/O load were applied to a dozen devices on a 16-way x86_64 system.

    nomerges Throughput %System Improvement (tput / %sys)
    -------- ------------ ----------- -------------------------
    0 12.45 MB/sec 0.669365609
    1 12.50 MB/sec 0.641519199 0.40% / 2.71%
    2 12.52 MB/sec 0.639849750 0.56% / 2.96%

    Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     

11 Jan, 2010

3 commits


30 Dec, 2009

1 commit


29 Dec, 2009

1 commit

  • queue_sector_alignment_offset returned the wrong value which caused
    partitions to report an incorrect alignment_offset. Since offset
    alignment calculation is needed several places it has been split into a
    separate helper function. The topology stacking function has been
    updated accordingly.

    Furthermore, comments have been added to clarify how the stacking
    function works.

    Signed-off-by: Martin K. Petersen
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

03 Dec, 2009

1 commit

  • The discard ioctl is used by mkfs utilities to clear a block device
    prior to putting metadata down. However, not all devices return zeroed
    blocks after a discard. Some drives return stale data, potentially
    containing old superblocks. It is therefore important to know whether
    discarded blocks are properly zeroed.

    Both ATA and SCSI drives have configuration bits that indicate whether
    zeroes are returned after a discard operation. Implement a block level
    interface that allows this information to be bubbled up the stack and
    queried via a new block device ioctl.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

26 Nov, 2009

1 commit

  • Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
    this causes problems on architectures where the icache doesn't fill from
    the dcache or with dcache aliases. The patch fixes this.

    The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
    pointless empty cache-thrashing loops on architectures for which
    flush_dcache_page() is a no-op. Every architecture was provided with this
    flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
    equal 1 or do nothing otherwise.

    See "fix mtd_blkdevs problem with caches on some architectures" discussion
    on LKML for more information.

    Signed-off-by: Ilya Loginov
    Cc: Ingo Molnar
    Cc: David Woodhouse
    Cc: Peter Horton
    Cc: "Ed L. Cashin"
    Signed-off-by: Jens Axboe

    Ilya Loginov
     

10 Nov, 2009

1 commit

  • While SSDs track block usage on a per-sector basis, RAID arrays often
    have allocation blocks that are bigger. Allow the discard granularity
    and alignment to be set and teach the topology stacking logic how to
    handle them.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

29 Oct, 2009

1 commit


05 Oct, 2009

1 commit

  • It was briefly introduced to allow CFQ to to delayed scheduling,
    but we ended up removing that feature again. So lets kill the
    function and export, and just switch CFQ back to the normal work
    schedule since it is now passing in a '0' delay from all call
    sites.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Oct, 2009

1 commit

  • Not all users of the topology information want to use libblkid. Provide
    the topology information through bdev ioctls.

    Also clarify sector size comments for existing BLK ioctls.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

03 Oct, 2009

1 commit


02 Oct, 2009

2 commits

  • Currently we set the bio size to the byte equivalent of the blocks to
    be trimmed when submitting the initial DISCARD ioctl. That means it
    is subject to the max_hw_sectors limitation of the HBA which is
    much lower than the size of a DISCARD request we can support.
    Add a separate max_discard_sectors tunable to limit the size for discard
    requests.

    We limit the max discard request size in bytes to 32bit as that is the
    limit for bio->bi_size. This could be much larger if we had a way to pass
    that information through the block layer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • prepare_discard_fn() was being called in a place where memory allocation
    was effectively impossible. This makes it inappropriate for all but
    the most trivial translations of Linux's DISCARD operation to the block
    command set. Additionally adding a payload there makes the ownership
    of the bio backing unclear as it's now allocated by the device driver
    and not the submitter as usual.

    It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
    the queue supports discard operations or not. blkdev_issue_discard now
    allocates a one-page, sector-length payload which is the right thing
    for the common ATA and SCSI implementations.

    The mtd implementation of prepare_discard_fn() is replaced with simply
    checking for the request being a discard.

    Largely based on a previous patch from Matthew Wilcox
    which did the prepare_discard_fn but not the different payload allocation
    yet.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Sep, 2009

2 commits

  • blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
    the only difference between the two is that blkdev_issue_discard needs to
    send a barrier discard request and blk_ioctl_discard a non-barrier one,
    and blk_ioctl_discard needs to wait on the request. To facilitates this
    add a flags argument to blkdev_issue_discard to control both aspects of the
    behaviour. This will be very useful later on for using the waiting
    funcitonality for other callers.

    Based on an earlier patch from Matthew Wilcox .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
    around it. DM needs this to avoid poking at the queue_limits directly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

11 Sep, 2009

5 commits

  • Test results here look good, and on big OLTP runs it has also shown
    to significantly increase cycles attributed to the database and
    cause a performance boost.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of just checking whether this device uses block layer
    tagging, we can improve the detection by looking at the maximum
    queue depth it has reached. If that crosses 4, then deem it a
    queuing device.

    This is important on high IOPS devices, since plugging hurts
    the performance there (it can be as much as 10-15% of the sys
    time).

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Failfast has characteristics from other attributes. When issuing,
    executing and successuflly completing requests, failfast doesn't make
    any difference. It only affects how a request is handled on failure.
    Allowing requests with different failfast settings to be merged cause
    normal IOs to fail prematurely while not allowing has performance
    penalties as failfast is used for read aheads which are likely to be
    located near in-flight or to-be-issued normal IOs.

    This patch introduces the concept of 'mixed merge'. A request is a
    mixed merge if it is merge of segments which require different
    handling on failure. Currently the only mixable attributes are
    failfast ones (or lack thereof).

    When a bio with different failfast settings is added to an existing
    request or requests of different failfast settings are merged, the
    merged request is marked mixed. Each bio carries failfast settings
    and the request always tracks failfast state of the first bio. When
    the request fails, blk_rq_err_bytes() can be used to determine how
    many bytes can be safely failed without crossing into an area which
    requires further retrials.

    This allows request merging regardless of failfast settings while
    keeping the failure handling correct.

    This patch only implements mixed merge but doesn't enable it. The
    next one will update SCSI to make use of mixed merge.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio and request use the same set of failfast bits. This patch makes
    the following changes to simplify things.

    * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
    bits coincide with __REQ_FAILFAST_* bits.

    * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
    but the matching is useless anyway. init_request_from_bio() is
    responsible for setting FAILFAST bits on FS requests and non-FS
    requests never use BIO_RW_AHEAD. Drop the code and comment from
    blk_rq_bio_prep().

    * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
    simplify FAILFAST flags handling in init_request_from_bio().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Aug, 2009

1 commit


12 Jul, 2009

1 commit

  • Move the definition of BLK_RW_ASYNC/BLK_RW_SYNC into linux/backing-dev.h
    so that it is available to all callers of set/clear_bdi_congested().

    This replaces commit 097041e576ee3a50d92dd643ee8ca65bf6a62e21 ("fuse:
    Fix build error"), which will be reverted.

    Signed-off-by: Trond Myklebust
    Acked-by: Larry Finger
    Cc: Jens Axboe
    Cc: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

11 Jul, 2009

2 commits

  • I overlooked SG_DXFER_TO_FROM_DEV support when I converted sg to use
    the block layer mapping API (2.6.28).

    Douglas Gilbert explained SG_DXFER_TO_FROM_DEV:

    http://www.spinics.net/lists/linux-scsi/msg37135.html

    =
    The semantics of SG_DXFER_TO_FROM_DEV were:
    - copy user space buffer to kernel (LLD) buffer
    - do SCSI command which is assumed to be of the DATA_IN
    (data from device) variety. This would overwrite
    some or all of the kernel buffer
    - copy kernel (LLD) buffer back to the user space.

    The idea was to detect short reads by filling the original
    user space buffer with some marker bytes ("0xec" it would
    seem in this report). The "resid" value is a better way
    of detecting short reads but that was only added this century
    and requires co-operation from the LLD.
    =

    This patch changes the block layer mapping API to support this
    semantics. This simply adds another field to struct rq_map_data and
    enables __bio_copy_iov() to copy data from user space even with READ
    requests.

    It's better to add the flags field and kills null_mapped and the new
    from_user fields in struct rq_map_data but that approach makes it
    difficult to send this patch to stable trees because st and osst
    drivers use struct rq_map_data (they were converted to use the block
    layer in 2.6.29 and 2.6.30). Well, I should clean up the block layer
    mapping API.

    zhou sf reported this regiression and tested this patch:

    http://www.spinics.net/lists/linux-scsi/msg37128.html
    http://www.spinics.net/lists/linux-scsi/msg37168.html

    Reported-by: zhou sf
    Tested-by: zhou sf
    Cc: stable@kernel.org
    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
    the bdi congestion wait queue logic, causing us to wait on congestion
    for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2009

1 commit

  • The initial patches to support this through sysfs export were broken
    and have been if 0'ed out in any release. So lets just kill the code
    and reclaim some space in struct request_queue, if anyone would later
    like to fixup the sysfs bits, the git history can easily restore
    the removed bits.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Jun, 2009

1 commit


13 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (29 commits)
    ide: re-implement ide_pci_init_one() on top of ide_pci_init_two()
    ide: unexport ide_find_dma_mode()
    ide: fix PowerMac bootup oops
    ide: skip probe if there are no devices on the port (v2)
    sl82c105: add printk() logging facility
    ide-tape: fix proc warning
    ide: add IDE_DFLAG_NIEN_QUIRK device flag
    ide: respect quirk_drives[] list on all controllers
    hpt366: enable all quirks for devices on quirk_drives[] list
    hpt366: sync quirk_drives[] list with pdc202xx_{new,old}.c
    ide: remove superfluous SELECT_MASK() call from do_rw_taskfile()
    ide: remove superfluous SELECT_MASK() call from ide_driveid_update()
    icside: remove superfluous ->maskproc method
    ide-tape: fix IDE_AFLAG_* atomic accesses
    ide-tape: change IDE_AFLAG_IGNORE_DSC non-atomically
    pdc202xx_old: kill resetproc() method
    pdc202xx_old: don't call pdc202xx_reset() on IRQ timeout
    pdc202xx_old: use ide_dma_test_irq()
    ide: preserve Host Protected Area by default (v2)
    ide-gd: implement block device ->set_capacity method (v2)
    ...

    Linus Torvalds
     

11 Jun, 2009

1 commit

  • This patch adds the following 2 interfaces for request-stacking drivers:

    - blk_rq_prep_clone(struct request *clone, struct request *orig,
    struct bio_set *bs, gfp_t gfp_mask,
    int (*bio_ctr)(struct bio *, struct bio*, void *),
    void *data)
    * Clones bios in the original request to the clone request
    (bio_ctr is called for each cloned bios.)
    * Copies attributes of the original request to the clone request.
    The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
    copied.

    - blk_rq_unprep_clone(struct request *clone)
    * Frees cloned bios from the clone request.

    Request stacking drivers (e.g. request-based dm) need to make a clone
    request for a submitted request and dispatch it to other devices.

    To allocate request for the clone, request stacking drivers may not
    be able to use blk_get_request() because the allocation may be done
    in an irq-disabled context.
    So blk_rq_prep_clone() takes a request allocated by the caller
    as an argument.

    For each clone bio in the clone request, request stacking drivers
    should be able to set up their own completion handler.
    So blk_rq_prep_clone() takes a callback function which is called
    for each clone bio, and a pointer for private data which is passed
    to the callback.

    NOTE:
    blk_rq_prep_clone() doesn't copy any actual data of the original
    request. Pages are shared between original bios and cloned bios.
    So caller must not complete the original request before the clone
    request.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Boaz Harrosh
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     

09 Jun, 2009

1 commit


07 Jun, 2009

1 commit

  • * Add ->set_capacity block device method and use it in rescan_partitions()
    to attempt enabling native capacity of the device upon detecting the
    partition which exceeds device capacity.

    * Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
    native capacity during partition scan.

    Together with the consecutive patch implementing ->set_capacity method in
    ide-gd device driver this allows automatic disabling of Host Protected Area
    (HPA) if any partitions overlapping HPA are detected.

    Cc: Robert Hancock
    Cc: Frans Pop
    Cc: "Andries E. Brouwer"
    Acked-by: Al Viro
    Emphatically-Acked-by: Alan Cox
    Signed-off-by: Bartlomiej Zolnierkiewicz

    Bartlomiej Zolnierkiewicz
     

03 Jun, 2009

1 commit

  • blk_queue_bounce_limit() is more than a wrapper about the request queue
    limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can
    be called by stacking drivers that wish to set the bounce limit
    explicitly.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

23 May, 2009

1 commit

  • To support devices with physical block sizes bigger than 512 bytes we
    need to ensure proper alignment. This patch adds support for exposing
    I/O topology characteristics as devices are stacked.

    logical_block_size is the smallest unit the device can address.

    physical_block_size indicates the smallest I/O the device can write
    without incurring a read-modify-write penalty.

    The io_min parameter is the smallest preferred I/O size reported by
    the device. In many cases this is the same as the physical block
    size. However, the io_min parameter can be scaled up when stacking
    (RAID5 chunk size > physical block size).

    The io_opt characteristic indicates the optimal I/O size reported by
    the device. This is usually the stripe width for arrays.

    The alignment_offset parameter indicates the number of bytes the start
    of the device/partition is offset from the device's natural alignment.
    Partition tools and MD/DM utilities can use this to pad their offsets
    so filesystems start on proper boundaries.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen