23 Aug, 2010

1 commit

  • Return of the bi_rw tests is no longer bool after commit 74450be1. But
    results of such tests are stored in bools. This doesn't fit in there
    for some compilers (gcc 4.5 here), so either use !! magic to get real
    bools or use ulong where the result is assigned somewhere.

    Signed-off-by: Jiri Slaby
    Cc: Christoph Hellwig
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jiri Slaby
     

12 Aug, 2010

1 commit

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter
     

09 Aug, 2010

1 commit


08 Aug, 2010

6 commits

  • Reviewed-by: FUJITA Tomonori

    Signed-off-by: Jens Axboe

    James Bottomley
     
  • Didn't cause a merge conflict, so fixed this one up manually
    post merge.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Allocating a fixed payload for discard requests always was a horrible hack,
    and it's not coming to byte us when adding support for discard in DM/MD.

    So change the code to leave the allocation of a payload to the lowlevel
    driver. Unfortunately that means we'll need another hack, which allows
    us to update the various block layer length fields indicating that we
    have a payload. Instead of hiding this in sd.c, which we already partially
    do for UNMAP support add a documented helper in the core block layer for it.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
    struct requests. This allows much easier grepping for different request
    types instead of unwinding through macros.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There are two reasons for doing this:

    - On SSD disks, the completion times aren't as random as they
    are for rotational drives. So it's questionable whether they
    should contribute to the random pool in the first place.

    - Calling add_disk_randomness() has a lot of overhead.

    This adds /sys/block//queue/add_random that will allow you to
    switch off on a per-device basis. The default setting is on, so there
    should be no functional changes from this patch.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2010

1 commit

  • In submit_bio, we count vm events by check READ/WRITE.
    But actually DISCARD_NOBARRIER also has the WRITE flag set.
    It looks as if in blkdev_issue_discard, we also add a
    page as the payload and the bio_has_data check isn't enough.
    So add another check for discard bio.

    Signed-off-by: Tao Ma
    Signed-off-by: Jens Axboe

    Tao Ma
     

17 Jun, 2010

1 commit

  • Filesystems assume that DISCARD_BARRIER are full barriers, so that they
    don't have to track in-progress discard operation when submitting new I/O.
    But currently we only treat them as elevator barriers, which don't
    actually do the nessecary queue drains.

    Also remove the unlikely around both the DISCARD and BARRIER requests -
    the happen far too often for a static mispredict.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jun, 2010

2 commits

  • blk_init_allocated_queue_node may fail and the caller _could_ retry.
    Accommodate the unlikely event that blk_init_allocated_queue_node is
    called on an already initialized (possibly partially) request_queue.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • On blk_init_allocated_queue_node failure, only free the request_queue if
    it is wasn't previously allocated outside the block layer
    (e.g. blk_init_queue_node was blk_init_allocated_queue_node caller).

    This addresses an interface bug introduced by the following commit:
    01effb0 block: allow initialization of previously allocated
    request_queue

    Otherwise the request_queue may be free'd out from underneath a caller
    that is managing the request_queue directly (e.g. caller uses
    blk_alloc_queue + blk_init_allocated_queue_node).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

11 May, 2010

1 commit

  • blk_init_queue() allocates the request_queue structure and then
    initializes it as needed (request_fn, elevator, etc).

    Split initialization out to blk_init_allocated_queue_node.
    Introduce blk_init_allocated_queue wrapper function to model existing
    blk_init_queue and blk_init_queue_node interfaces.

    Export elv_register_queue to allow a newly added elevator to be
    registered with sysfs. Export elv_unregister_queue for symmetry.

    These changes allow DM to initialize a device's request_queue with more
    precision. In particular, DM no longer unconditionally initializes a
    full request_queue (elevator et al). It only does so for a
    request-based DM device.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

09 Apr, 2010

1 commit

  • This includes both the number of bios merged into requests belonging to this
    cgroup as well as the number of requests merged together.
    In the past, we've observed different merging behavior across upstream kernels,
    some by design some actual bugs. This stat helps a lot in debugging such
    problems when applications report decreased throughput with a new kernel
    version.

    This needed adding an extra elevator function to capture bios being merged as I
    did not want to pollute elevator code with blkiocg knowledge and hence needed
    the accounting invocation to come from CFQ.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

06 Apr, 2010

1 commit

  • One of the features of laptop-mode is that it forces a writeout of dirty
    pages if something else triggers a physical read or write from a device.
    The current implementation flushes pages on all devices, rather than only
    the one that triggered the flush. This patch alters the behaviour so that
    only the recently accessed block device is flushed, preventing other
    disks being spun up for no terribly good reason.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Jens Axboe

    Matthew Garrett
     

02 Apr, 2010

1 commit

  • We also add start_time_ns and io_start_time_ns fields to struct request
    here to record the time when a request is created and when it is
    dispatched to device. We use ns uints here as ms and jiffies are
    not very useful for non-rotational media.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

26 Feb, 2010

1 commit


25 Feb, 2010

1 commit


23 Feb, 2010

2 commits


26 Nov, 2009

1 commit

  • Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
    this causes problems on architectures where the icache doesn't fill from
    the dcache or with dcache aliases. The patch fixes this.

    The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
    pointless empty cache-thrashing loops on architectures for which
    flush_dcache_page() is a no-op. Every architecture was provided with this
    flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
    equal 1 or do nothing otherwise.

    See "fix mtd_blkdevs problem with caches on some architectures" discussion
    on LKML for more information.

    Signed-off-by: Ilya Loginov
    Cc: Ingo Molnar
    Cc: David Woodhouse
    Cc: Peter Horton
    Cc: "Ed L. Cashin"
    Signed-off-by: Jens Axboe

    Ilya Loginov
     

24 Oct, 2009

1 commit

  • With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the
    following errors:

    end_request: I/O error, dev vda, sector 0
    end_request: I/O error, dev vda, sector 0

    The errors go away if dm stops submitting empty barriers, by reverting:

    commit 52b1fd5a27c625c78373e024bf570af3c9d44a79
    Author: Mikulas Patocka
    dm: send empty barriers to targets in dm_flush

    We should silently error all barriers, even empty barriers, on devices
    like virtio_blk which don't support them.

    See also:

    https://bugzilla.redhat.com/514901

    Signed-off-by: Mark McLoughlin
    Signed-off-by: Mike Snitzer
    Acked-by: Alasdair G Kergon
    Acked-by: Mikulas Patocka
    Cc: Rusty Russell
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Mark McLoughlin
     

07 Oct, 2009

1 commit

  • Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
    and write statistics of in_flight requests. And exported the number
    of read and write requests in progress seperately through sysfs.

    But Corrado Zoccolo reported getting strange
    output from "iostat -kx 2". Global values for service time and
    utilization were garbage. For interval values, utilization was always
    100%, and service time is higher than normal.

    So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b

    The problem was in part_round_stats_single(), I missed the following:
    if (now == part->stamp)
    return;

    - if (part->in_flight) {
    + if (part_in_flight(part)) {
    __part_stat_add(cpu, part, time_in_queue,
    part_in_flight(part) * (now - part->stamp));
    __part_stat_add(cpu, part, io_ticks, (now - part->stamp));

    With this chunk included, the reported regression gets fixed.

    Signed-off-by: Nikanth Karthikesan

    --
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

05 Oct, 2009

2 commits

  • It was briefly introduced to allow CFQ to to delayed scheduling,
    but we ended up removing that feature again. So lets kill the
    function and export, and just switch CFQ back to the normal work
    schedule since it is now passing in a '0' delay from all call
    sites.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This reverts commit a9327cac440be4d8333bba975cbbf76045096275.

    Corrado Zoccolo reports:

    "with 2.6.32-rc1 I started getting the following strange output from
    "iostat -kx 2":
    Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)

    avg-cpu: %user %nice %system %iowait %steal %idle
    10,70 0,00 3,16 15,75 0,00 70,38

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 18,22 0,00 0,67 0,01 14,77 0,02
    43,94 0,01 10,53 39043915,03 2629219,87
    sdb 60,89 9,68 50,79 3,04 1724,43 50,52
    65,95 0,70 13,06 488437,47 2629219,87

    avg-cpu: %user %nice %system %iowait %steal %idle
    2,72 0,00 0,74 0,00 0,00 96,53

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    6,68 0,00 0,99 0,00 0,00 92,33

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00

    avg-cpu: %user %nice %system %iowait %steal %idle
    4,40 0,00 0,73 1,47 0,00 93,40

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
    avgrq-sz avgqu-sz await svctm %util
    sda 0,00 0,00 0,00 0,00 0,00 0,00
    0,00 0,00 0,00 0,00 100,00
    sdb 0,00 4,00 0,00 3,00 0,00 28,00
    18,67 0,06 19,50 333,33 100,00

    Global values for service time and utilization are garbage. For
    interval values, utilization is always 100%, and service time is
    higher than normal.

    I bisected it down to:
    [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
    statistics of in_flight requests
    and verified that reverting just that commit indeed solves the issue
    on 2.6.32-rc1."

    So until this is debugged, revert the bad commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Oct, 2009

1 commit


02 Oct, 2009

3 commits

  • Since 2.6.31 now has request-based device-mapper, it's useful to have
    a tracepoint for request-remapping as well as bio-remapping.
    This patch adds a tracepoint for request-remapping, trace_block_rq_remap().

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: Alasdair G Kergon
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     
  • Currently we set the bio size to the byte equivalent of the blocks to
    be trimmed when submitting the initial DISCARD ioctl. That means it
    is subject to the max_hw_sectors limitation of the HBA which is
    much lower than the size of a DISCARD request we can support.
    Add a separate max_discard_sectors tunable to limit the size for discard
    requests.

    We limit the max discard request size in bytes to 32bit as that is the
    limit for bio->bi_size. This could be much larger if we had a way to pass
    that information through the block layer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • prepare_discard_fn() was being called in a place where memory allocation
    was effectively impossible. This makes it inappropriate for all but
    the most trivial translations of Linux's DISCARD operation to the block
    command set. Additionally adding a payload there makes the ownership
    of the bio backing unclear as it's now allocated by the device driver
    and not the submitter as usual.

    It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
    the queue supports discard operations or not. blkdev_issue_discard now
    allocates a one-page, sector-length payload which is the right thing
    for the common ATA and SCSI implementations.

    The mtd implementation of prepare_discard_fn() is replaced with simply
    checking for the request being a discard.

    Largely based on a previous patch from Matthew Wilcox
    which did the prepare_discard_fn but not the different payload allocation
    yet.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Sep, 2009

1 commit

  • * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
    block: use blkdev_issue_discard in blk_ioctl_discard
    Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
    block: don't assume device has a request list backing in nr_requests store
    block: Optimal I/O limit wrapper
    cfq: choose a new next_req when a request is dispatched
    Seperate read and write statistics of in_flight requests
    aoe: end barrier bios with EOPNOTSUPP
    block: trace bio queueing trial only when it occurs
    block: enable rq CPU completion affinity by default
    cfq: fix the log message after dispatched a request
    block: use printk_once
    cciss: memory leak in cciss_init_one()
    splice: update mtime and atime on files
    block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
    cfq-iosched: get rid of must_alloc flag
    block: use interrupts disabled version of raise_softirq_irqoff()
    block: fix comment in blk-iopoll.c
    block: adjust default budget for blk-iopoll
    block: fix long lines in block/blk-iopoll.c
    block: add blk-iopoll, a NAPI like approach for block devices
    ...

    Linus Torvalds
     

14 Sep, 2009

1 commit

  • Currently, there is a single in_flight counter measuring the number of
    requests in the request_queue. But some monitoring tools would like to
    know how many read requests and write requests are in progress. Split the
    current in_flight counter into two seperate counters for read and write.

    This information is exported as a sysfs attribute, as changing the
    currently available stat files would break the existing tools.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Jens Axboe

    Nikanth Karthikesan
     

11 Sep, 2009

6 commits

  • If BIO is discarded or cross over end of device,
    BIO queueing trial doesn't occur.

    Actually the trace was called just before make_request at first:
    [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
         2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4

    And then 2 patches added some checks between them:
    [PATCH] md: check bio address after mapping through partitions
           5ddfe9691c91a244e8d1be597b6428fcefd58103,
    [BLOCK] Don't allow empty barriers to be passed down to
    queues that don't grok them
           51fd77bd9f512ab6cc9df0733ba1caaab89eb957

    It breaks original goal.
    Let's trace it only when it happens.

    Signed-off-by: Minchan Kim
    Acked-by: Wu Fengguang
    Cc: Li Zefan
    Signed-off-by: Jens Axboe

    Minchan Kim
     
  • Instead of just checking whether this device uses block layer
    tagging, we can improve the detection by looking at the maximum
    queue depth it has reached. If that crosses 4, then deem it a
    queuing device.

    This is important on high IOPS devices, since plugging hurts
    the performance there (it can be as much as 10-15% of the sys
    time).

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Failfast has characteristics from other attributes. When issuing,
    executing and successuflly completing requests, failfast doesn't make
    any difference. It only affects how a request is handled on failure.
    Allowing requests with different failfast settings to be merged cause
    normal IOs to fail prematurely while not allowing has performance
    penalties as failfast is used for read aheads which are likely to be
    located near in-flight or to-be-issued normal IOs.

    This patch introduces the concept of 'mixed merge'. A request is a
    mixed merge if it is merge of segments which require different
    handling on failure. Currently the only mixable attributes are
    failfast ones (or lack thereof).

    When a bio with different failfast settings is added to an existing
    request or requests of different failfast settings are merged, the
    merged request is marked mixed. Each bio carries failfast settings
    and the request always tracks failfast state of the first bio. When
    the request fails, blk_rq_err_bytes() can be used to determine how
    many bytes can be safely failed without crossing into an area which
    requires further retrials.

    This allows request merging regardless of failfast settings while
    keeping the failure handling correct.

    This patch only implements mixed merge but doesn't enable it. The
    next one will update SCSI to make use of mixed merge.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio and request use the same set of failfast bits. This patch makes
    the following changes to simplify things.

    * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
    bits coincide with __REQ_FAILFAST_* bits.

    * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
    but the matching is useless anyway. init_request_from_bio() is
    responsible for setting FAILFAST bits on FS requests and non-FS
    requests never use BIO_RW_AHEAD. Drop the code and comment from
    blk_rq_bio_prep().

    * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
    simplify FAILFAST flags handling in init_request_from_bio().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This enables us to track who does what and print info. Its main use
    is catching dirty inodes on the default_backing_dev_info, so we can
    fix that up.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jul, 2009

1 commit

  • Prior to the change for more sane end_io functions, we exported
    the helpers with the normal EXPORT_SYMBOL(). That got changed
    to _GPL() for the new interface. Revert that particular change,
    on the basis that this is basic functionality and doesn't dip
    into internal structures. If these exports can't be non-GPL,
    then we may as well make EXPORT_SYMBOL() imply GPL for
    everything.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Jul, 2009

1 commit

  • Move the assignment of a default lock below blk_init_queue() to
    blk_queue_make_request(), so we also get to set the default lock
    for ->make_request_fn() based drivers. This is important since the
    queue flag locking requires a lock to be in place.

    Signed-off-by: Jens Axboe

    Jens Axboe