16 Jun, 2009

1 commit


10 Jun, 2009

1 commit

  • TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
    these new capabilities to this tracepoint:

    - zero-copy and per-cpu splice() tracing
    - binary tracing without printf overhead
    - structured logging records exposed under /debug/tracing/events
    - trace events embedded in function tracer output and other plugins
    - user-defined, per tracepoint filter expressions
    ...

    Cons:

    - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

    - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

    - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

    I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

    dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
    1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
    2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
    3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

    So the overhead of tracing is very small, and no regression when using
    those trace events vs blktrace.

    And the binary output of TRACE_EVENT is much smaller than blktrace:

    # ls -l -h
    -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
    -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
    -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

    Following are some comparisons between TRACE_EVENT and blktrace:

    plug:
    kjournald-480 [000] 303.084981: block_plug: [kjournald]
    kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

    unplug_io:
    kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
    kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

    remap:
    kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 v3:

    - use the newly introduced __dynamic_array().

    Changelog from v1 -> v2:

    - use __string() instead of __array() to minimize the memory required
    to store hex dump of rq->cmd().

    - support large pc requests.

    - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

    - some cleanups.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan
     

07 May, 2009

1 commit


06 May, 2009

1 commit


15 Apr, 2009

1 commit


09 Apr, 2009

8 commits

  • Barriers are submitted to a worker thread that issues them in-order.

    The thread is modified so that when it sees a barrier request it waits
    for all pending IO before the request then submits the barrier and
    waits for it. (We must wait, otherwise it could be intermixed with
    following requests.)

    Errors from the barrier request are recorded in a per-device barrier_error
    variable. There may be only one barrier request in progress at once.

    For now, the barrier request is converted to a non-barrier request when
    sending it to the underlying device.

    This patch guarantees correct barrier behavior if the underlying device
    doesn't perform write-back caching. The same requirement existed before
    barriers were supported in dm.

    Bottom layer barrier support (sending barriers by target drivers) and
    handling devices with write-back caches will be done in further patches.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Remove queue_io return value and a loop in dm_request.

    IO may be submitted to a worker thread with queue_io(). queue_io() sets
    DMF_QUEUE_IO_TO_THREAD so that all further IO is queued for the thread. When
    the thread finishes its work, it clears DMF_QUEUE_IO_TO_THREAD and from this
    point on, requests are submitted from dm_request again. This will be used
    for processing barriers.

    Remove the loop in dm_request. queue_io() can submit I/Os to the worker thread
    even if DMF_QUEUE_IO_TO_THREAD was not set.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Rework shutting down on suspend and document the associated rules.

    Drop write lock in __split_and_process_bio to allow more processing
    concurrency.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Refactor the code in dm_request().

    Require the new DMF_BLOCK_FOR_SUSPEND flag on readahead bios we will
    discard so we don't drop such bios while processing a barrier.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Split the DMF_BLOCK_IO flag into two.

    DMF_BLOCK_IO_FOR_SUSPEND is set when I/O must be blocked while suspending a
    device. DMF_QUEUE_IO_TO_THREAD is set when I/O must be queued to a
    worker thread for later processing.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Refactor dm_wq_work() to make later patch more readable.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Prepare for full barrier implementation: first remove the restricted support.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch provides support for data integrity passthrough in the device
    mapper.

    - If one or more component devices support integrity an integrity
    profile is preallocated for the DM device.

    - If all component devices have compatible profiles the DM device is
    flagged as capable.

    - Handle integrity metadata when splitting and cloning bios.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Alasdair G Kergon

    Martin K. Petersen
     

03 Apr, 2009

10 commits


17 Mar, 2009

1 commit

  • The following oops has been reported when dm-crypt runs over a loop device.

    ...
    [ 70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
    ...
    [ 70.381058] Call Trace:
    [ 70.381058] [] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
    [ 70.381058] [] ? crypt_endio+0xa2/0xaa [dm_crypt]
    [ 70.381058] [] ? crypt_endio+0x0/0xaa [dm_crypt]
    [ 70.381058] [] ? bio_endio+0x2b/0x2e
    [ 70.381058] [] ? dec_pending+0x224/0x23b [dm_mod]
    [ 70.381058] [] ? clone_endio+0x79/0xa4 [dm_mod]
    [ 70.381058] [] ? clone_endio+0x0/0xa4 [dm_mod]
    [ 70.381058] [] ? bio_endio+0x2b/0x2e
    [ 70.381058] [] ? loop_thread+0x380/0x3b7
    [ 70.381058] [] ? do_lo_send_aops+0x0/0x165
    [ 70.381058] [] ? autoremove_wake_function+0x0/0x33
    [ 70.381058] [] ? loop_thread+0x0/0x3b7

    When a table is being replaced, it waits for I/O to complete
    before destroying the mempool, but the endio function doesn't
    call mempool_free() until after completing the bio.

    Fix it by swapping the order of those two operations.

    The same problem occurs in dm.c with md referenced after dec_pending.
    Again, we swap the order.

    Cc: stable@kernel.org
    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     

06 Jan, 2009

5 commits

  • Implement simple read-only sysfs entry for device-mapper block device.

    This patch adds a simple sysfs directory named "dm" under block device
    properties and implements
    - name attribute (string containing mapped device name)
    - uuid attribute (string containing UUID, or empty string if not set)

    The kobject is embedded in mapped_device struct, so no additional
    memory allocation is needed for initializing sysfs entry.

    During the processing of sysfs attribute we need to lock mapped device
    which is done by a new function dm_get_from_kobj, which returns the md
    associated with kobject and increases the usage count.

    Each 'show attribute' function is responsible for its own locking.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Rework table reference counting.

    The existing code uses a reference counter. When the last reference is
    dropped and the counter reaches zero, the table destructor is called.
    Table reference counters are acquired/released from upcalls from other
    kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
    If the reference counter reaches zero in one of the upcalls, the table
    destructor is called from almost random kernel code.

    This leads to various problems:
    * dm_any_congested being called under a spinlock, which calls the
    destructor, which calls some sleeping function.
    * the destructor attempting to take a lock that is already taken by the
    same process.
    * stale reference from some other kernel code keeps the table
    constructed, which keeps some devices open, even after successful
    return from "dmsetup remove". This can confuse lvm and prevent closing
    of underlying devices or reusing device minor numbers.

    The patch changes reference counting so that the table destructor can be
    called only at predetermined places.

    The table has always exactly one reference from either mapped_device->map
    or hash_cell->new_map. After this patch, this reference is not counted
    in table->holders. A pair of dm_create_table/dm_destroy_table functions
    is used for table creation/destruction.

    Temporary references from the other code increase table->holders. A pair
    of dm_table_get/dm_table_put functions is used to manipulate it.

    When the table is about to be destroyed, we wait for table->holders to
    reach 0. Then, we call the table destructor. We use active waiting with
    msleep(1), because the situation happens rarely (to one user in 5 years)
    and removing the device isn't performance-critical task: the user doesn't
    care if it takes one tick more or not.

    This way, the destructor is called only at specific points
    (dm_table_destroy function) and the above problems associated with lazy
    destruction can't happen.

    Finally remove the temporary protection added to dm_any_congested().

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Implement barrier support for single device DM devices

    This patch implements barrier support in DM for the common case of dm linear
    just remapping a single underlying device. In this case we can safely
    pass the barrier through because there can be no reordering between
    devices.

    NB. Any DM device might cease to support barriers if it gets
    reconfigured so code must continue to allow for a possible
    -EOPNOTSUPP on every barrier bio submitted. - agk

    Signed-off-by: Andi Kleen
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Andi Kleen
     
  • This patch prepares some kmem_caches for request-based dm.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • Move one dm_table_put() so that the last reference in the thread
    gets dropped in __unbind().

    This is required for a following patch,
    dm-table-rework-reference-counting.patch, which will change the logic in
    such a way that table destructor is called only at specific points in
    the code.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

29 Dec, 2008

1 commit

  • Instead of having a global bio slab cache, add a reference to one
    in each bio_set that is created. This allows for personalized slabs
    in each bio_set, so that they can have bios of different sizes.

    This means we can personalize the bios we return. File systems may
    want to embed the bio inside another structure, to avoid allocation
    more items (and stuffing them in ->bi_private) after the get a bio.
    Or we may want to embed a number of bio_vecs directly at the end
    of a bio, to avoid doing two allocations to return a bio. This is now
    possible.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Nov, 2008

2 commits

  • Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
    sites. Spread them out to the usage sites, as suggested by
    Mathieu Desnoyers.

    Signed-off-by: Ingo Molnar
    Acked-by: Mathieu Desnoyers

    Ingo Molnar
     
  • This was a forward port of work done by Mathieu Desnoyers, I changed it to
    encode the 'what' parameter on the tracepoint name, so that one can register
    interest in specific events and not on classes of events to then check the
    'what' parameter.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

14 Nov, 2008

2 commits

  • dm_any_congested() just checks for the DMF_BLOCK_IO and has no
    code to make sure that suspend waits for dm_any_congested() to
    complete. This patch adds such a check.

    Without it, a race can occur with dm_table_put() attempting to
    destroying the table in the wrong thread, the one running
    dm_any_congested() which is meant to be quick and return
    immediately.

    Two examples of problems:
    1. Sleeping functions called from congested code, the caller
    of which holds a spin lock.
    2. An ABBA deadlock between pdflush and multipathd. The two locks
    in contention are inode lock and kernel lock.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • This doesn't fix any bug, just moves wake_up immediately after decrementing
    md->pending, for better code readability.

    It must be clear to anyone manipulating md->pending to wake up
    the queue if md->pending reaches zero, so move the wakeup as close to
    the decrementing as possible.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

24 Oct, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
    [PATCH] kill the rest of struct file propagation in block ioctls
    [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
    [PATCH] get rid of blkdev_locked_ioctl()
    [PATCH] get rid of blkdev_driver_ioctl()
    [PATCH] sanitize blkdev_get() and friends
    [PATCH] remember mode of reiserfs journal
    [PATCH] propagate mode through swsusp_close()
    [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
    [PATCH] pass fmode_t to blkdev_put()
    [PATCH] kill the unused bsize on the send side of /dev/loop
    [PATCH] trim file propagation in block/compat_ioctl.c
    [PATCH] end of methods switch: remove the old ones
    [PATCH] switch sr
    [PATCH] switch sd
    [PATCH] switch ide-scsi
    [PATCH] switch tape_block
    [PATCH] switch dcssblk
    [PATCH] switch dasd
    [PATCH] switch mtd_blkdevs
    [PATCH] switch mmc
    ...

    Linus Torvalds
     

22 Oct, 2008

3 commits

  • This patch tidies local_init() in preparation for request-based dm.
    No functional change.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.

    The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
    is never invoked because:
    - 'goto flush_and_out' is the same as 'goto out' because
    the 'goto flush_and_out' is called only when '!noflush'
    - If r is non-zero, then the code above will invoke 'goto out'
    and skip this code.

    No functional change.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • When a bio gets split, mark its fragments with the BIO_CLONED flag.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Alasdair G Kergon

    Martin K. Petersen
     

21 Oct, 2008

2 commits

  • ioctl() doesn't need BKL here

    Signed-off-by: Al Viro

    Al Viro
     
  • To keep the size of changesets sane we split the switch by drivers;
    to keep the damn thing bisectable we do the following:
    1) rename the affected methods, add ones with correct
    prototypes, make (few) callers handle both. That's this changeset.
    2) for each driver convert to new methods. *ALL* drivers
    are converted in this series.
    3) kill the old (renamed) methods.

    Note that it _is_ a flagday; all in-tree drivers are converted and by the
    end of this series no trace of old methods remain. The only reason why
    we do that this way is to keep the damn thing bisectable and allow per-driver
    debugging if anything goes wrong.

    New methods:
    open(bdev, mode)
    release(disk, mode)
    ioctl(bdev, mode, cmd, arg) /* Called without BKL */
    compat_ioctl(bdev, mode, cmd, arg)
    locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */

    Signed-off-by: Al Viro

    Al Viro