23 Aug, 2010

1 commit

  • Currently drivers must do an elevator_exit() + elevator_init()
    to switch IO schedulers. There are a few problems with this:

    - Since commit 1abec4fdbb142e3ccb6ce99832fae42129134a96,
    elevator_init() requires a zeroed out q->elevator
    pointer. The two existing in-kernel users don't do that.

    - It will only work at initialization time, since using the
    above two-staged construct does not properly quisce the queue.

    So add elevator_change() which takes care of this, and convert
    the elv_iosched_store() sysfs interface to use this helper as well.

    Reported-by: Peter Oberparleiter
    Reported-by: Kevin Vigor
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2010

1 commit

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter
     

08 Aug, 2010

2 commits

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
    struct requests. This allows much easier grepping for different request
    types instead of unwinding through macros.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jun, 2010

1 commit


24 May, 2010

1 commit

  • Bio-based DM doesn't use an elevator (queue is !blk_queue_stackable()).

    Longer-term DM will not allocate an elevator for bio-based DM. But even
    then there will be small potential for an elevator to be allocated for
    a request-based DM table only to have a bio-based table be loaded in the
    end.

    Displaying "none" for bio-based DM will help avoid user confusion.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

11 May, 2010

1 commit

  • blk_init_queue() allocates the request_queue structure and then
    initializes it as needed (request_fn, elevator, etc).

    Split initialization out to blk_init_allocated_queue_node.
    Introduce blk_init_allocated_queue wrapper function to model existing
    blk_init_queue and blk_init_queue_node interfaces.

    Export elv_register_queue to allow a newly added elevator to be
    registered with sysfs. Export elv_unregister_queue for symmetry.

    These changes allow DM to initialize a device's request_queue with more
    precision. In particular, DM no longer unconditionally initializes a
    full request_queue (elevator et al). It only does so for a
    request-based DM device.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

09 Apr, 2010

1 commit

  • This includes both the number of bios merged into requests belonging to this
    cgroup as well as the number of requests merged together.
    In the past, we've observed different merging behavior across upstream kernels,
    some by design some actual bugs. This stat helps a lot in debugging such
    problems when applications report decreased throughput with a new kernel
    version.

    This needed adding an extra elevator function to capture bios being merged as I
    did not want to pollute elevator code with blkiocg knowledge and hence needed
    the accounting invocation to come from CFQ.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

02 Apr, 2010

1 commit


08 Mar, 2010

1 commit

  • Constify struct sysfs_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Acked-by: David Teigland
    Acked-by: Matt Domsch
    Acked-by: Maciej Sosnowski
    Acked-by: Hans J. Koch
    Acked-by: Pekka Enberg
    Acked-by: Jens Axboe
    Acked-by: Stephen Hemminger
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     

29 Jan, 2010

1 commit

  • Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
    merges at all are to be attempted (not even the simple one-hit cache).

    The following table illustrates the additional benefit - 5 minute runs of
    a random I/O load were applied to a dozen devices on a 16-way x86_64 system.

    nomerges Throughput %System Improvement (tput / %sys)
    -------- ------------ ----------- -------------------------
    0 12.45 MB/sec 0.669365609
    1 12.50 MB/sec 0.641519199 0.40% / 2.71%
    2 12.52 MB/sec 0.639849750 0.56% / 2.96%

    Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     

13 Oct, 2009

1 commit


09 Oct, 2009

1 commit

  • elv_iosched_store() ignore the return value of strstrip(). It makes small
    inconsistent behavior.

    This patch fixes it.


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    bash: echo: write error: Invalid argument


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    noop [anticipatory] deadline cfq

    Cc: Li Zefan
    Cc: Jens Axboe
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Jens Axboe

    KOSAKI Motohiro
     

03 Oct, 2009

1 commit

  • AS is mostly a subset of CFQ, so there's little point in still
    providing this separate IO scheduler. Hopefully at some point we
    can get down to one single IO scheduler again, at least this brings
    us closer by having only one intelligent IO scheduler.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Sep, 2009

2 commits

  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Update scsi_io_completion() such that it only fails requests till the
    next error boundary and retry the leftover. This enables block layer
    to merge requests with different failfast settings and still behave
    correctly on errors. Allow merge of requests of different failfast
    settings.

    As SCSI is currently the only subsystem which follows failfast status,
    there's no need to worry about other block drivers for now.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Cc: James Bottomley
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Jul, 2009

1 commit

  • Commit ab0fd1debe730ec9998678a0c53caefbd121ed10 tries to prevent merge
    of requests with different failfast settings. In elv_rq_merge_ok(),
    it compares new bio's failfast flags against the merge target
    request's. However, the flag testing accessors for bio and blk don't
    return boolean but the tested bit value directly and FAILFAST on bio
    and blk don't match, so directly comparing them with == results in
    false negative unnecessary preventing merge of readahead requests.

    This patch convert the results to boolean by negating them before
    comparison.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Boaz Harrosh
    Cc: FUJITA Tomonori
    Cc: James Bottomley
    Cc: Jeff Garzik

    Tejun Heo
     

04 Jul, 2009

1 commit

  • Block layer used to merge requests and bios with different failfast
    settings. This caused regular IOs to fail prematurely when they were
    merged into failfast requests for readahead.

    Niel Lambrechts could trigger the problem semi-reliably on ext4 when
    resuming from STR. ext4 uses readahead when reading inodes and
    combined with the deterministic extra SATA PHY exception cycle during
    resume on the specific configuration, non-readahead inode read would
    fail causing ext4 errors. Please read the following thread for
    details.

    http://lkml.org/lkml/2009/5/23/21

    This patch makes block layer reject merging if the failfast settings
    don't match. This is correct but likely to lower IO performance by
    preventing regular IOs from mingling into surrounding readahead
    requests. Changes to allow such mixed merges and handle errors
    correctly will be added later.

    Signed-off-by: Tejun Heo
    Reported-by: Niel Lambrechts
    Cc: Theodore Tso
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     

10 Jun, 2009

1 commit

  • TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
    these new capabilities to this tracepoint:

    - zero-copy and per-cpu splice() tracing
    - binary tracing without printf overhead
    - structured logging records exposed under /debug/tracing/events
    - trace events embedded in function tracer output and other plugins
    - user-defined, per tracepoint filter expressions
    ...

    Cons:

    - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

    - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

    - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

    I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

    dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
    1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
    2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
    3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

    So the overhead of tracing is very small, and no regression when using
    those trace events vs blktrace.

    And the binary output of TRACE_EVENT is much smaller than blktrace:

    # ls -l -h
    -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
    -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
    -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

    Following are some comparisons between TRACE_EVENT and blktrace:

    plug:
    kjournald-480 [000] 303.084981: block_plug: [kjournald]
    kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

    unplug_io:
    kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
    kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

    remap:
    kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 v3:

    - use the newly introduced __dynamic_array().

    Changelog from v1 -> v2:

    - use __string() instead of __array() to minimize the memory required
    to store hex dump of rq->cmd().

    - support large pc requests.

    - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

    - some cleanups.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     

02 Jun, 2009

1 commit

  • I found one more mis-conversion to the 'request is always dequeued
    when completing' model in elv_abort_queue() during code inspection.
    Although I haven't hit any problem caused by this mis-conversion yet
    and just done compile/boot test, please apply if you have no problem.

    Request must be dequeued when it completes.
    However, elv_abort_queue() completes requests without dequeueing.
    This will cause oops in the __blk_end_request_all().
    This patch fixes the oops.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     

23 May, 2009

1 commit

  • Currently stacking devices do not have a queue directory in sysfs.
    However, many of the I/O characteristics like sector size, maximum
    request size, etc. are queue properties.

    This patch enables the queue directory for MD/DM devices. The elevator
    code has been modified to deal with queues that do not have an I/O
    scheduler.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

20 May, 2009

1 commit


11 May, 2009

1 commit

  • With recent cleanups, there is no place where low level driver
    directly manipulates request fields. This means that the 'hard'
    request fields always equal the !hard fields. Convert all
    rq->sectors, nr_sectors and current_nr_sectors references to
    accessors.

    While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

    [ Impact: use pos and nr_sectors accessors ]

    Signed-off-by: Tejun Heo
    Acked-by: Geert Uytterhoeven
    Tested-by: Grant Likely
    Acked-by: Grant Likely
    Tested-by: Adrian McMenamin
    Acked-by: Adrian McMenamin
    Acked-by: Mike Miller
    Cc: James Bottomley
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Borislav Petkov
    Cc: Sergei Shtylyov
    Cc: Eric Moore
    Cc: Alan Stern
    Cc: FUJITA Tomonori
    Cc: Pete Zaitcev
    Cc: Stephen Rothwell
    Cc: Paul Clements
    Cc: Tim Waugh
    Cc: Jeff Garzik
    Cc: Jeremy Fitzhardinge
    Cc: Alex Dubov
    Cc: David Woodhouse
    Cc: Martin Schwidefsky
    Cc: Dario Ballabio
    Cc: David S. Miller
    Cc: Rusty Russell
    Cc: unsik Kim
    Cc: Laurent Vivier
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Apr, 2009

3 commits

  • There are many [__]blk_end_request() call sites which call it with
    full request length and expect full completion. Many of them ensure
    that the request actually completes by doing BUG_ON() the return
    value, which is awkward and error-prone.

    This patch adds [__]blk_end_request_all() which takes @rq and @error
    and fully completes the request. BUG_ON() is added to to ensure that
    this actually happens.

    Most conversions are simple but there are a few noteworthy ones.

    * cdrom/viocd: viocd_end_request() replaced with direct calls to
    __blk_end_request_all().

    * s390/block/dasd: dasd_end_request() replaced with direct calls to
    __blk_end_request_all().

    * s390/char/tape_block: tapeblock_end_request() replaced with direct
    calls to blk_end_request_all().

    [ Impact: cleanup ]

    Signed-off-by: Tejun Heo
    Cc: Russell King
    Cc: Stephen Rothwell
    Cc: Mike Miller
    Cc: Martin Schwidefsky
    Cc: Jeff Garzik
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Alex Dubov
    Cc: James Bottomley

    Tejun Heo
     
  • Impact: code reorganization

    elv_next_request() and elv_dequeue_request() are public block layer
    interface than actual elevator implementation. They mostly deal with
    how requests interact with block layer and low level drivers at the
    beginning of rqeuest processing whereas __elv_next_request() is the
    actual eleveator request fetching interface.

    Move the two functions to blk-core.c. This prepares for further
    interface cleanup.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • blk_start_queueing() is identical to __blk_run_queue() except that it
    doesn't check for recursion. None of the current users depends on
    blk_start_queueing() running request_fn directly. Replace usages of
    blk_start_queueing() with [__]blk_run_queue() and kill it.

    [ Impact: removal of mostly duplicate interface function ]

    Signed-off-by: Tejun Heo

    Tejun Heo
     

15 Apr, 2009

1 commit


07 Apr, 2009

2 commits


06 Apr, 2009

1 commit


29 Dec, 2008

3 commits

  • Just use struct elevator_queue everywhere instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Empty barrier required special handling in __elv_next_request() to
    complete it without letting the low level driver see it.

    With previous changes, barrier code is now flexible enough to skip the
    BAR step using the same barrier sequence selection mechanism. Drop
    the special handling and mask off q->ordered from start_ordered().

    Remove blk_empty_barrier() test which now has no user.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Barrier completion had the following assumptions.

    * start_ordered() couldn't finish the whole sequence properly. If all
    actions are to be skipped, q->ordseq is set correctly but the actual
    completion was never triggered thus hanging the barrier request.

    * Drain completion in elv_complete_request() assumed that there's
    always at least one request in the queue when drain completes.

    Both assumptions are true but these assumptions need to be removed to
    improve empty barrier implementation. This patch makes the following
    changes.

    * Make start_ordered() use blk_ordered_complete_seq() to mark skipped
    steps complete and notify __elv_next_request() that it should fetch
    the next request if the whole barrier has completed inside
    start_ordered().

    * Make drain completion path in elv_complete_request() check whether
    the queue is empty. Empty queue also indicates drain completion.

    * While at it, convert 0/1 return from blk_do_ordered() to false/true.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Dec, 2008

1 commit


03 Dec, 2008

1 commit

  • blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
    both start the timeout timer. Barrier code dequeues the original
    barrier request but doesn't passes the request itself to lower level
    driver, only broken down proxy requests; however, as the original
    barrier code goes through the same dequeue path and timeout timer is
    started on it. If barrier sequence takes long enough, this timer
    expires but the low level driver has no idea about this request and
    oops follows.

    Timeout timer shouldn't have been started on the original barrier
    request as it never goes through actual IO. This patch unexports
    elv_dequeue_request(), which has no external user anyway, and makes it
    operate on elevator proper w/o adding the timer and make
    blkdev_dequeue_request() call elv_dequeue_request() and add timer.
    Internal users which don't pass the request to driver - barrier code
    and end_that_request_last() - are converted to use
    elv_dequeue_request().

    Signed-off-by: Tejun Heo
    Cc: Mike Anderson
    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Nov, 2008

2 commits

  • Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
    sites. Spread them out to the usage sites, as suggested by
    Mathieu Desnoyers.

    Signed-off-by: Ingo Molnar
    Acked-by: Mathieu Desnoyers

    Ingo Molnar
     
  • This was a forward port of work done by Mathieu Desnoyers, I changed it to
    encode the 'what' parameter on the tracepoint name, so that one can register
    interest in specific events and not on classes of events to then check the
    'what' parameter.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

06 Nov, 2008

1 commit

  • Block queue supports two usage models - one where block driver peeks
    at the front of queue using elv_next_request(), processes it and
    finishes it and the other where block driver peeks at the front of
    queue, dequeue the request using blkdev_dequeue_request() and finishes
    it. The latter is more flexible as it allows the driver to process
    multiple commands concurrently.

    These two inconsistent usage models affect the block layer
    implementation confusing. For some, elv_next_request() is considered
    the issue point while others consider blkdev_dequeue_request() the
    issue point.

    Till now the inconsistency mostly affect only accounting, so it didn't
    really break anything seriously; however, with block layer timeout,
    this inconsistency hits hard. Block layer considers
    elv_next_request() the issue point and adds timer but SCSI layer
    thinks it was just peeking and when the request can't process the
    command right away, it's just left there without further processing.
    This makes the request dangling on the timer list and, when the timer
    goes off, the request which the SCSI layer and below think is still on
    the block queue ends up in the EH queue, causing various problems - EH
    hang (failed count goes over busy count and EH never wakes up),
    WARN_ON() and oopses as low level driver trying to handle the unknown
    command, etc. depending on the timing.

    As SCSI midlayer is the only user of block layer timer at the moment,
    moving blk_add_timer() to elv_dequeue_request() fixes the problem;
    however, this two usage models definitely need to be cleaned up in the
    future.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Oct, 2008

1 commit

  • Callers should use either blk_run_queue/__blk_run_queue, or
    blk_start_queueing() to invoke request handling instead of calling
    ->request_fn() directly as that does not take the queue stopped
    flag into account.

    Also add appropriate comments on the above functions to detail
    their usage.

    Signed-off-by: Jens Axboe

    Jens Axboe