03 Jun, 2011

1 commit

  • Hi, Jens,

    If you recall, I posted an RFC patch for this back in July of last year:
    http://lkml.org/lkml/2010/7/13/279

    The basic problem is that a process can issue a never-ending stream of
    async direct I/Os to the same sector on a device, thus starving out
    other I/O in the system (due to the way the alias handling works in both
    cfq and deadline). The solution I proposed back then was to start
    dispatching from the fifo after a certain number of aliases had been
    dispatched. Vivek asked why we had to treat aliases differently at all,
    and I never had a good answer. So, I put together a simple patch which
    allows aliases to be added to the rb tree (it adds them to the right,
    though that doesn't matter as the order isn't guaranteed anyway). I
    think this is the preferred solution, as it doesn't break up time slices
    in CFQ or batches in deadline. I've tested it, and it does solve the
    starvation issue. Let me know what you think.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

21 May, 2011

2 commits

  • We don't need them anymore, so kill:

    - REQ_ON_PLUG checks in various places
    - !rq_mergeable() check in plug merging

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
    had churn in the core area that makes it difficult to handle
    patches for eg cfq or blk-throttle. Instead of requiring that they
    be based in older versions with bugs that have been fixed later
    in the rc cycle, merge in 2.6.39 final.

    Also fixes up conflicts in the below files.

    Conflicts:
    drivers/block/paride/pcd.c
    drivers/cdrom/viocd.c
    drivers/ide/ide-cd.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 May, 2011

1 commit

  • After the anticipatory scheduler was dropped, there was no need to
    special-case the request_module string. As such, drop the redundant
    sprintf and stack variable.

    Signed-off-by: Kees Cook
    Signed-off-by: Jens Axboe

    Kees Cook
     

22 Apr, 2011

1 commit


18 Apr, 2011

1 commit

  • Instead of overloading __blk_run_queue to force an offload to kblockd
    add a new blk_run_queue_async helper to do it explicitly. I've kept
    the blk_queue_stopped check for now, but I suspect it's not needed
    as the check we do when the workqueue items runs should be enough.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Apr, 2011

1 commit


21 Mar, 2011

1 commit

  • One of the disadvantages of on-stack plugging is that we potentially
    lose out on merging since all pending IO isn't always visible to
    everybody. When we flush the on-stack plugs, right now we don't do
    any checks to see if potential merge candidates could be utilized.

    Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
    It works just ELEVATOR_INSERT_SORT, but first checks whether we can
    merge with an existing request before doing the insertion (if we fail
    merging).

    This fixes a regression with multiple processes issuing IO that
    can be merged.

    Thanks to Shaohua Li for testing and fixing
    an accounting bug.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Mar, 2011

3 commits

  • Conflicts:
    block/blk-core.c
    block/blk-flush.c
    drivers/md/raid1.c
    drivers/md/raid10.c
    drivers/md/raid5.c
    fs/nilfs2/btnode.c
    fs/nilfs2/mdt.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

02 Mar, 2011

1 commit

  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Feb, 2011

1 commit

  • Flush requests are never put on the IO scheduler. Convert request
    structure's elevator_private* into an array and have the flush fields
    share a union with it.

    Reclaim the space lost in 'struct request' by moving 'completion_data'
    back in the union with 'rb_node'.

    Signed-off-by: Mike Snitzer
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

25 Jan, 2011

1 commit

  • The current FLUSH/FUA support has evolved from the implementation
    which had to perform queue draining. As such, sequencing is done
    queue-wide one flush request after another. However, with the
    draining requirement gone, there's no reason to keep the queue-wide
    sequential approach.

    This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
    request is sequenced individually. The actual FLUSH execution is
    double buffered and whenever a request wants to execute one for either
    PRE or POSTFLUSH, it queues on the pending queue. Once certain
    conditions are met, a flush request is issued and on its completion
    all pending requests proceed to the next sequence.

    This allows arbitrary merging of different type of flushes. How they
    are merged can be primarily controlled and tuned by adjusting the
    above said 'conditions' used to determine when to issue the next
    flush.

    This is inspired by Darrick's patches to merge multiple zero-data
    flushes which helps workloads with highly concurrent fsync requests.

    * As flush requests are never put on the IO scheduler, request fields
    used for flush share space with rq->rb_node. rq->completion_data is
    moved out of the union. This increases the request size by one
    pointer.

    As rq->elevator_private* are used only by the iosched too, it is
    possible to reduce the request size further. However, to do that,
    we need to modify request allocation path such that iosched data is
    not allocated for flush requests.

    * FLUSH/FUA processing happens on insertion now instead of dispatch.

    - Comments updated as per Vivek and Mike.

    Signed-off-by: Tejun Heo
    Cc: "Darrick J. Wong"
    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Cc: Vivek Goyal
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Nov, 2010

1 commit

  • REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
    at this point is:

    - various checks inside the block layer.
    - sanity checks in bio based drivers.
    - now unused bio_empty_barrier helper.
    - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
    but Xen really needs to sort out it's barrier situaton.
    - setting of ordered tags in uas - dead code copied from old scsi
    drivers.
    - scsi different retry for barriers - it's dead and should have been
    removed when flushes were converted to FS requests.
    - blktrace handling of barriers - removed. Someone who knows blktrace
    better should add support for REQ_FLUSH and REQ_FUA, though.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Oct, 2010

1 commit


07 Oct, 2010

1 commit

  • 2.6.36 introduces an API for drivers to switch the IO scheduler
    instead of manually calling the elevator exit and init functions.
    This API was added since q->elevator must be cleared in between
    those two calls. And since we already have this functionality
    directly from use by the sysfs interface to switch schedulers
    online, it was prudent to reuse it internally too.

    But this API needs the queue to be in a fully initialized state
    before it is called, or it will attempt to unregister elevator
    kobjects before they have been added. This results in an oops
    like this:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
    IP: [] sysfs_create_dir+0x2e/0xc0
    PGD 47ddfc067 PUD 47c6a1067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
    CPU 2
    Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb

    Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
    RIP: 0010:[] [] sysfs_create_dir+0x2e/0xc0
    RSP: 0018:ffff88027da25d08 EFLAGS: 00010246
    RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
    RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
    RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
    R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
    R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
    FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
    Stack:
    ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
    ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
    ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
    Call Trace:
    [] kobject_add_internal+0xe7/0x1f0
    [] kobject_add_varg+0x38/0x60
    [] kobject_add+0x69/0x90
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sub_preempt_count+0x9d/0xe0
    [] ? _raw_spin_unlock+0x30/0x50
    [] ? sysfs_remove_dir+0x20/0xa0
    [] ? sysfs_remove_dir+0x34/0xa0
    [] elv_register_queue+0x34/0xa0
    [] elevator_change+0xfd/0x250
    [] ? t_init+0x0/0x361 [t]
    [] ? t_init+0x0/0x361 [t]
    [] t_init+0xa8/0x361 [t]
    [] do_one_initcall+0x3e/0x170
    [] sys_init_module+0xbd/0x220
    [] system_call_fastpath+0x16/0x1b
    Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
    RIP [] sysfs_create_dir+0x2e/0xc0
    RSP
    CR2: 0000000000000051
    ---[ end trace a6541d3bf07945df ]---

    Fix this by adding a registered bit to the elevator queue, which is
    set when the sysfs kobjects have been registered.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Sep, 2010

1 commit

  • Filesystems will take all the responsibilities for ordering requests
    around commit writes and will only indicate how the commit writes
    themselves should be handled by block layers. This patch drops
    barrier ordering by queue draining from block layer. Ordering by
    draining implementation was somewhat invasive to request handling.
    List of notable changes follow.

    * Each queue has 1 bit color which is flipped on each barrier issue.
    This is used to track whether a given request is issued before the
    current barrier or not. REQ_ORDERED_COLOR flag and coloring
    implementation in __elv_add_request() are removed.

    * Requests which shouldn't be processed yet for draining were stalled
    by returning -EAGAIN from blk_do_ordered() according to the test
    result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
    This logic is removed.

    * Draining completion logic in elv_completed_request() removed.

    * All barrier sequence requests were queued to request queue and then
    trckled to lower layer according to progress and thus maintaining
    request orders during requeue was necessary. This is replaced by
    queueing the next request in the barrier sequence only after the
    current one is complete from blk_ordered_complete_seq(), which
    removes the need for multiple proxy requests in struct request_queue
    and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
    elv_insert().

    * As barriers no longer have ordering constraints, there's no need to
    dump the whole elevator onto the dispatch queue on each barrier.
    Insert barriers at the front instead.

    * If other barrier requests come to the front of the dispatch queue
    while one is already in progress, they are stored in
    q->pending_barriers and restored to dispatch queue one-by-one after
    each barrier completion from blk_ordered_complete_seq().

    Signed-off-by: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Aug, 2010

1 commit

  • Currently drivers must do an elevator_exit() + elevator_init()
    to switch IO schedulers. There are a few problems with this:

    - Since commit 1abec4fdbb142e3ccb6ce99832fae42129134a96,
    elevator_init() requires a zeroed out q->elevator
    pointer. The two existing in-kernel users don't do that.

    - It will only work at initialization time, since using the
    above two-staged construct does not properly quisce the queue.

    So add elevator_change() which takes care of this, and convert
    the elv_iosched_store() sysfs interface to use this helper as well.

    Reported-by: Peter Oberparleiter
    Reported-by: Kevin Vigor
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2010

1 commit

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter
     

08 Aug, 2010

2 commits

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
    struct requests. This allows much easier grepping for different request
    types instead of unwinding through macros.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jun, 2010

1 commit


24 May, 2010

1 commit

  • Bio-based DM doesn't use an elevator (queue is !blk_queue_stackable()).

    Longer-term DM will not allocate an elevator for bio-based DM. But even
    then there will be small potential for an elevator to be allocated for
    a request-based DM table only to have a bio-based table be loaded in the
    end.

    Displaying "none" for bio-based DM will help avoid user confusion.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

11 May, 2010

1 commit

  • blk_init_queue() allocates the request_queue structure and then
    initializes it as needed (request_fn, elevator, etc).

    Split initialization out to blk_init_allocated_queue_node.
    Introduce blk_init_allocated_queue wrapper function to model existing
    blk_init_queue and blk_init_queue_node interfaces.

    Export elv_register_queue to allow a newly added elevator to be
    registered with sysfs. Export elv_unregister_queue for symmetry.

    These changes allow DM to initialize a device's request_queue with more
    precision. In particular, DM no longer unconditionally initializes a
    full request_queue (elevator et al). It only does so for a
    request-based DM device.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

09 Apr, 2010

1 commit

  • This includes both the number of bios merged into requests belonging to this
    cgroup as well as the number of requests merged together.
    In the past, we've observed different merging behavior across upstream kernels,
    some by design some actual bugs. This stat helps a lot in debugging such
    problems when applications report decreased throughput with a new kernel
    version.

    This needed adding an extra elevator function to capture bios being merged as I
    did not want to pollute elevator code with blkiocg knowledge and hence needed
    the accounting invocation to come from CFQ.

    Signed-off-by: Divyesh Shah
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

02 Apr, 2010

1 commit


08 Mar, 2010

1 commit

  • Constify struct sysfs_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Acked-by: David Teigland
    Acked-by: Matt Domsch
    Acked-by: Maciej Sosnowski
    Acked-by: Hans J. Koch
    Acked-by: Pekka Enberg
    Acked-by: Jens Axboe
    Acked-by: Stephen Hemminger
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     

29 Jan, 2010

1 commit

  • Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
    merges at all are to be attempted (not even the simple one-hit cache).

    The following table illustrates the additional benefit - 5 minute runs of
    a random I/O load were applied to a dozen devices on a 16-way x86_64 system.

    nomerges Throughput %System Improvement (tput / %sys)
    -------- ------------ ----------- -------------------------
    0 12.45 MB/sec 0.669365609
    1 12.50 MB/sec 0.641519199 0.40% / 2.71%
    2 12.52 MB/sec 0.639849750 0.56% / 2.96%

    Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     

13 Oct, 2009

1 commit


09 Oct, 2009

1 commit

  • elv_iosched_store() ignore the return value of strstrip(). It makes small
    inconsistent behavior.

    This patch fixes it.


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    bash: echo: write error: Invalid argument


    ====================================
    # cd /sys/block/{blockdev}/queue

    case1:
    # echo "anticipatory" > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case2:
    # echo "anticipatory " > scheduler
    # cat scheduler
    noop [anticipatory] deadline cfq

    case3:
    # echo " anticipatory" > scheduler
    noop [anticipatory] deadline cfq

    Cc: Li Zefan
    Cc: Jens Axboe
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Jens Axboe

    KOSAKI Motohiro
     

03 Oct, 2009

1 commit

  • AS is mostly a subset of CFQ, so there's little point in still
    providing this separate IO scheduler. Hopefully at some point we
    can get down to one single IO scheduler again, at least this brings
    us closer by having only one intelligent IO scheduler.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Sep, 2009

2 commits

  • Get rid of any functions that test for these bits and make callers
    use bio_rw_flagged() directly. Then it is at least directly apparent
    what variable and flag they check.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Update scsi_io_completion() such that it only fails requests till the
    next error boundary and retry the leftover. This enables block layer
    to merge requests with different failfast settings and still behave
    correctly on errors. Allow merge of requests of different failfast
    settings.

    As SCSI is currently the only subsystem which follows failfast status,
    there's no need to worry about other block drivers for now.

    Signed-off-by: Tejun Heo
    Cc: Niel Lambrechts
    Cc: James Bottomley
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Jul, 2009

1 commit

  • Commit ab0fd1debe730ec9998678a0c53caefbd121ed10 tries to prevent merge
    of requests with different failfast settings. In elv_rq_merge_ok(),
    it compares new bio's failfast flags against the merge target
    request's. However, the flag testing accessors for bio and blk don't
    return boolean but the tested bit value directly and FAILFAST on bio
    and blk don't match, so directly comparing them with == results in
    false negative unnecessary preventing merge of readahead requests.

    This patch convert the results to boolean by negating them before
    comparison.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Boaz Harrosh
    Cc: FUJITA Tomonori
    Cc: James Bottomley
    Cc: Jeff Garzik

    Tejun Heo
     

04 Jul, 2009

1 commit

  • Block layer used to merge requests and bios with different failfast
    settings. This caused regular IOs to fail prematurely when they were
    merged into failfast requests for readahead.

    Niel Lambrechts could trigger the problem semi-reliably on ext4 when
    resuming from STR. ext4 uses readahead when reading inodes and
    combined with the deterministic extra SATA PHY exception cycle during
    resume on the specific configuration, non-readahead inode read would
    fail causing ext4 errors. Please read the following thread for
    details.

    http://lkml.org/lkml/2009/5/23/21

    This patch makes block layer reject merging if the failfast settings
    don't match. This is correct but likely to lower IO performance by
    preventing regular IOs from mingling into surrounding readahead
    requests. Changes to allow such mixed merges and handle errors
    correctly will be added later.

    Signed-off-by: Tejun Heo
    Reported-by: Niel Lambrechts
    Cc: Theodore Tso
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     

10 Jun, 2009

1 commit

  • TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
    these new capabilities to this tracepoint:

    - zero-copy and per-cpu splice() tracing
    - binary tracing without printf overhead
    - structured logging records exposed under /debug/tracing/events
    - trace events embedded in function tracer output and other plugins
    - user-defined, per tracepoint filter expressions
    ...

    Cons:

    - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

    - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

    - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

    I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

    dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
    1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
    2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
    3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

    So the overhead of tracing is very small, and no regression when using
    those trace events vs blktrace.

    And the binary output of TRACE_EVENT is much smaller than blktrace:

    # ls -l -h
    -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
    -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
    -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

    Following are some comparisons between TRACE_EVENT and blktrace:

    plug:
    kjournald-480 [000] 303.084981: block_plug: [kjournald]
    kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

    unplug_io:
    kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
    kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

    remap:
    kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 v3:

    - use the newly introduced __dynamic_array().

    Changelog from v1 -> v2:

    - use __string() instead of __array() to minimize the memory required
    to store hex dump of rq->cmd().

    - support large pc requests.

    - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

    - some cleanups.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     

02 Jun, 2009

1 commit

  • I found one more mis-conversion to the 'request is always dequeued
    when completing' model in elv_abort_queue() during code inspection.
    Although I haven't hit any problem caused by this mis-conversion yet
    and just done compile/boot test, please apply if you have no problem.

    Request must be dequeued when it completes.
    However, elv_abort_queue() completes requests without dequeueing.
    This will cause oops in the __blk_end_request_all().
    This patch fixes the oops.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda