04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

01 Nov, 2011

4 commits


12 Sep, 2011

3 commits


02 Aug, 2011

4 commits

  • DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
    regardless of whether or not a given DM device's underlying devices
    also advertised a need for them.

    Block's flush-merge changes from 2.6.39 have proven to be more costly
    for DM devices. Performance regressions have been reported even when
    DM's underlying devices do not advertise that they have a write cache.

    Fix the performance regressions by configuring a DM device's flushing
    capabilities based on those of the underlying devices' capabilities.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
    whether the device can accept bios larger than the size its merge
    function returns. When set, use this to send large bios to snapshots
    which can split them if necessary. Snapshot I/O may be significantly
    fragmented and this approach seems to improve peformance.

    Before the patch, dm_set_device_limits restricted bio size to page size
    if the underlying device had a merge function and the target didn't
    provide a merge function. After the patch, dm_set_device_limits
    restricts bio size to page size if the underlying device has a merge
    function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
    provide a merge function.

    The snapshot target can't provide a merge function because when the merge
    function is called, it is impossible to determine where the bio will be
    remapped. Previously this led us to impose a 4k limit, which we can
    now remove if the snapshot store is located on a device without a merge
    function. Together with another patch for optimizing full chunk writes,
    it improves performance from 29MB/s to 40MB/s when writing to the
    filesystem on snapshot store.

    If the snapshot store is placed on a non-dm device with a merge function
    (such as md-raid), device mapper still limits all bios to page size.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Remove 'discards_supported' from the dm_table structure. The same
    information can be easily discovered from the table's target(s) in
    dm_table_supports_discards().

    Before this fix dm_table_supports_discards() would skip checking the
    individual targets' 'discards_supported' flag if any one target in the
    table didn't set num_discard_requests > 0. Now the per-target
    'discards_supported' flag is effective at insuring the final DM device
    advertises discard support. But, to be clear, targets that don't
    support discards (!num_discard_requests) will not receive discard
    requests.

    Also DMWARN if a target sets 'discards_supported' override but forgets
    to set 'num_discard_requests'.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Destroy _minor_idr when unloading the core dm module. (Found by kmemleak.)

    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

22 Mar, 2011

1 commit


17 Mar, 2011

1 commit

  • MD and DM create a new bio_set for every metadevice. Each bio_set has an
    integrity mempool attached regardless of whether the metadevice is
    capable of passing integrity metadata. This is a waste of memory.

    Instead we defer the allocation decision to MD and DM since we know at
    metadevice creation time whether integrity passthrough is needed or not.

    Automatic integrity mempool allocation can then be removed from
    bioset_create() and we make an explicit integrity allocation for the
    fs_bio_set.

    Signed-off-by: Martin K. Petersen
    Reported-by: Zdenek Kabelac
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

5 commits

  • This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
    This patch is just a clean-up and no functional change.

    The spin_lock_irq() was leftover from the early request-based dm code,
    where map_request() used to enable interrupts.
    Since current map_request() never enables interrupts, we can change it
    to spin_lock() to match the prior spin_unlock().

    Auditing through the dm and block-layer code called from
    map_request(), I confirmed all functions save/restore interrupt
    status, so no function returning with interrupts enabled.
    Also I haven't observed any problem on my test environment which
    uses scsi and lpfc driver after heavy I/O testing with occasional
    path down/up.

    Added BUG_ON() to detect breakage in future.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • kmirrord_wq, kcopyd_work and md->wq are created per dm instance and
    serve only a single work item from the dm instance, so non-reentrant
    workqueues would provide the same ordering guarantees as ordered ones
    while allowing CPU affinity and use of the workqueues for other
    purposes. Switch them to non-reentrant workqueues.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Convert all create[_singlethread]_work() users to the new
    alloc[_ordered]_workqueue(). This conversion is mechanical and
    doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • This patch replaces dm_mutex with _minor_lock in dm_blk_close()
    and then removes it.

    During the BKL conversion, commit 6e9624b8caec290d28b4c6d9ec75749df6372b87
    (block: push down BKL into .open and .release) pushed lock_kernel()
    down into dm_blk_open/close calls.
    Commit 2a48fc0ab24241755dc93bfd4f01d68efab47f5a
    (block: autoconvert trivial BKL users to private mutex) converted it to a
    local mutex, but _minor_lock is sufficient.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the
    size of a DM device. This additional locking is unnecessary because
    i_size_write() is already protected by the existing critical section in
    dm_swap_table(). DM already has a reference on md->bdev so the
    associated bd_inode may be changed without lifetime concerns.

    A negative side-effect of having held md->bdev->bd_inode->i_mutex was
    that a concurrent DM device resize and flush (via fsync) would deadlock.
    Dropping md->bdev->bd_inode->i_mutex eliminates this potential for
    deadlock. The following reproducer no longer deadlocks:
    https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html

    Signed-off-by: Mike Snitzer
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Mike Snitzer
     

07 Jan, 2011

1 commit

  • The "error" field in block_bio_complete is not assigned, leaving the memory area
    uninitialized (keeping garbage data). Pass an additional tracepoint argument to
    this event to initialize this field.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Mathieu Desnoyers
    CC: Steven Rostedt
    CC: Frederic Weisbecker
    CC: Ingo Molnar
    CC: Thomas Gleixner
    CC: Li Zefan
    CC: Alan.Brunelle@hp.com
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

16 Nov, 2010

1 commit


23 Oct, 2010

1 commit

  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     

05 Oct, 2010

1 commit

  • The block device drivers have all gained new lock_kernel
    calls from a recent pushdown, and some of the drivers
    were already using the BKL before.

    This turns the BKL into a set of per-driver mutexes.
    Still need to check whether this is safe to do.

    file=$1
    name=$2
    if grep -q lock_kernel ${file} ; then
    if grep -q 'include.*linux.mutex.h' ${file} ; then
    sed -i '/include.*/d' ${file}
    else
    sed -i 's/include.*.*$/include /g' ${file}
    fi
    sed -i ${file} \
    -e "/^#include.*linux.mutex.h/,$ {
    1,/^\(static\|int\|long\)/ {
    /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

    } }" \
    -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
    -e '/[ ]*cycle_kernel_lock();/d'
    else
    sed -i -e '/include.*\/d' ${file} \
    -e '/cycle_kernel_lock()/d'
    fi

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

10 Sep, 2010

6 commits

  • Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
    clarity.

    Simplify logic associated with REQ_FLUSH conditionals.

    Introduce a BUG_ON() and add a few more helpful comments to the code
    so that it is clear that all flushes are empty.

    Cleanup __split_and_process_bio() so that an empty flush isn't processed
    by a 'sector_count' focused while loop.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Now queue_io() is called from dec_pending(), which may be called with
    interrupts disabled, so queue_io() must not enable interrupts
    unconditionally and must save/restore the current interrupts status.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kiyoshi Ueda
     
  • Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
    against other bio's. This patch relaxes ordering around flushes.

    * A flush bio is no longer deferred to workqueue directly. It's
    processed like other bio's but __split_and_process_bio() uses
    md->flush_bio as the clone source. md->flush_bio is initialized to
    empty flush during md initialization and shared for all flushes.

    * As a flush bio now travels through the same execution path as other
    bio's, there's no need for dedicated error handling path either. It
    can use the same error handling path in dec_pending(). Dedicated
    error handling removed along with md->flush_error.

    * When dec_pending() detects that a flush has completed, it checks
    whether the original bio has data. If so, the bio is queued to the
    deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

    * As flush sequencing is handled in the usual issue/completion path,
    dm_wq_work() no longer needs to handle flushes differently. Now its
    only responsibility is re-issuing deferred bio's the same way as
    _dm_request() would. REQ_FLUSH handling logic including
    process_flush() is dropped.

    * There's no reason for queue_io() and dm_wq_work() write lock
    dm->io_lock. queue_io() now only uses md->deferred_lock and
    dm_wq_work() read locks dm->io_lock.

    * bio's no longer need to be queued on the deferred list while a flush
    is in progress making DMF_QUEUE_IO_TO_THREAD unncessary. Drop it.

    This avoids stalling the device during flushes and simplifies the
    implementation.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patch converts request-based dm to support the new REQ_FLUSH/FUA.

    The original request-based flush implementation depended on
    request_queue blocking other requests while a barrier sequence is in
    progress, which is no longer true for the new REQ_FLUSH/FUA.

    In general, request-based dm doesn't have infrastructure for cloning
    one source request to multiple targets, but the original flush
    implementation had a special mostly independent path which can issue
    flushes to multiple targets and sequence them. However, the
    capability isn't currently in use and adds a lot of complexity.
    Moreoever, it's unlikely to be useful in its current form as it
    doesn't make sense to be able to send out flushes to multiple targets
    when write requests can't be.

    This patch rips out special flush code path and deals handles
    REQ_FLUSH/FUA requests the same way as other requests. The only
    special treatment is that REQ_FLUSH requests use the block address 0
    when finding target, which is enough for now.

    * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
    suggested by Mike Snitzer

    Signed-off-by: Tejun Heo
    Acked-by: Mike Snitzer
    Tested-by: Kiyoshi Ueda
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
    now deprecated REQ_HARDBARRIER.

    * -EOPNOTSUPP handling logic dropped.

    * Preflush is handled as before but postflush is dropped and replaced
    with passing down REQ_FUA to member request_queues. This replaces
    one array wide cache flush w/ member specific FUA writes.

    * __split_and_process_bio() now calls __clone_and_map_flush() directly
    for flushes and guarantees all FLUSH bio's going to targets are zero
    ` length.

    * It's now guaranteed that all FLUSH bio's which are passed onto dm
    targets are zero length. bio_empty_barrier() tests are replaced
    with REQ_FLUSH tests.

    * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

    * Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely
    enough to be marked with unlikely().

    * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
    doesn't support cache flushing. Advertise REQ_FLUSH | REQ_FUA
    capability.

    * Request based dm isn't converted yet. dm_init_request_based_queue()
    resets flush support to 0 for now. To avoid disturbing request
    based dm code, dm->flush_error is added for bio based dm while
    requested based dm continues to use dm->barrier_error.

    Lightly tested linear, stripe, raid1, snap and crypt targets. Please
    proceed with caution as I'm not familiar with the code base.

    Signed-off-by: Tejun Heo
    Cc: dm-devel@redhat.com
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
    requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with
    -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
    blk_queue_flush().

    blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a
    device has write cache and can flush it, it should set REQ_FLUSH. If
    the device can handle FUA writes, it should also set REQ_FUA.

    All blk_queue_ordered() users are converted.

    * ORDERED_DRAIN is mapped to 0 which is the default value.
    * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
    * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

    Signed-off-by: Tejun Heo
    Acked-by: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Michael S. Tsirkin
    Cc: Jeremy Fitzhardinge
    Cc: Chris Wright
    Cc: FUJITA Tomonori
    Cc: Geert Uytterhoeven
    Cc: David S. Miller
    Cc: Alasdair G Kergon
    Cc: Pierre Ossman
    Cc: Stefan Weinhuber
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Aug, 2010

9 commits

  • Update __clone_and_map_discard to loop across all targets in a DM
    device's table when it processes a discard bio. If a discard crosses a
    target boundary it must be split accordingly.

    Update __issue_target_requests and __issue_target_request to allow a
    cloned discard bio to have a custom start sector and size.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Split max_io_len_target_boundary out of max_io_len so that the discard
    support can make use of it without duplicating max_io_len code.

    Avoiding max_io_len's split_io logic enables DM's discard support to
    submit the entire discard request to a target. But discards must still
    be split on target boundaries.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Rename __flush_target to __issue_target_request now that it is used to
    issue both flush and discard requests.

    Introduce __issue_target_requests as a convenient wrapper to
    __issue_target_request 'num_flush_requests' or 'num_discard_requests'
    times per target.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Allow discards to be passed through to linear mappings if at least one
    underlying device supports it. Discards will be forwarded only to
    devices that support them.

    A target that supports discards should set num_discard_requests to
    indicate how many times each discard request must be submitted to it.

    Verify table's underlying devices support discards prior to setting the
    associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • 'target_request_nr' is a more generic name that reflects the fact that
    it will be used for both flush and discard support.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Change bio-based mapped devices no longer to have a fully initialized
    request_queue (request_fn, elevator, etc). This means bio-based DM
    devices no longer register elevator sysfs attributes ('iosched/' tree
    or 'scheduler' other than "none").

    In contrast, a request-based DM device will continue to have a full
    request_queue and will register elevator sysfs attributes. Therefore
    a user can determine a DM device's type by checking if elevator sysfs
    attributes exist.

    First allocate a minimalist request_queue structure for a DM device
    (needed for both bio and request-based DM).

    Initialization of a full request_queue is deferred until it is known
    that the DM device is request-based, at the end of the table load
    sequence.

    Factor DM device's request_queue initialization:
    - common to both request-based and bio-based into dm_init_md_queue().
    - specific to request-based into dm_init_request_based_queue().

    The md->type_lock mutex is used to protect md->queue, in addition to
    md->type, during table_load().

    A DM device's first table_load will establish the immutable md->type.
    But md->queue initialization, based on md->type, may fail at that time
    (because blk_init_allocated_queue cannot allocate memory). Therefore
    any subsequent table_load must (re)try dm_setup_md_queue independently of
    establishing md->type.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Determine whether a mapped device is bio-based or request-based when
    loading its first (inactive) table and don't allow that to be changed
    later.

    This patch performs different device initialisation in each of the two
    cases. (We don't think it's necessary to add code to support changing
    between the two types.)

    Allowed md->type transitions:
    DM_TYPE_NONE to DM_TYPE_BIO_BASED
    DM_TYPE_NONE to DM_TYPE_REQUEST_BASED

    We now prevent table_load from replacing the inactive table with a
    conflicting type of table even after an explicit table_clear.

    Introduce 'type_lock' into the struct mapped_device to protect md->type
    and to prepare for the next patch that will change the queue
    initialization and allocate memory while md->type_lock is held.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    drivers/md/dm-ioctl.c | 15 +++++++++++++++
    drivers/md/dm.c | 37 ++++++++++++++++++++++++++++++-------
    drivers/md/dm.h | 5 +++++
    include/linux/dm-ioctl.h | 4 ++--
    4 files changed, 52 insertions(+), 9 deletions(-)

    Mike Snitzer
     
  • When processing barriers, skip the second flush if processing the bio
    failed with -EOPNOTSUPP. This can happen with discard+barrier requests.
    If the device doesn't support discard, there would be two useless
    SYNCHRONIZE CACHE commands. The first dm_flush cannot be so easily
    optimized out, so we leave it there.

    Previously, -EOPNOTSUPP could be received in dec_pending only with empty
    barriers and we ignored that error, assuming the device not supporting
    cache flushes has cache always consistent. With the addition of discard
    barriers, this -EOPNOTSUPP can also be generated by discards and we
    must record it in md->barrier_error for process_barrier.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch separates the device deletion code from dm_put()
    to make sure the deletion happens in the process context.

    By this patch, device deletion always occurs in an ioctl (process)
    context and dm_put() can be called in interrupt context.
    As a result, the request-based dm's bad dm_put() usage pointed out
    by Mikulas below disappears.
    http://marc.info/?l=dm-devel&m=126699981019735&w=2

    Without this patch, I confirmed there is a case to crash the system:
    dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())

    Some more backgrounds and details:
    In request-based dm, a device opener can remove a mapped_device
    while the last request is still completing, because bios in the last
    request complete first and then the device opener can close and remove
    the mapped_device before the last request completes:
    CPU0 CPU1
    =================================================================
    <>
    blk_end_request_all(clone_rq)
    blk_update_request(clone_rq)
    bio_endio(clone_bio) == end_clone_bio
    blk_update_request(orig_rq)
    bio_endio(orig_bio)
    <>
    dm_blk_close()
    dev_remove()
    dm_put(md)
    <>
    blk_finish_request(clone_rq)
    ....
    dm_end_request(clone_rq)
    free_rq_clone(clone_rq)
    blk_end_request_all(orig_rq)
    rq_completed(md)

    So request-based dm used dm_get()/dm_put() to hold md for each I/O
    until its request completion handling is fully done.
    However, the final dm_put() can call the device deletion code which
    must not be run in interrupt context and may cause kernel panic.

    To solve the problem, this patch moves the device deletion code,
    dm_destroy(), to predetermined places that is actually deleting
    the mapped_device in ioctl (process) context, and changes dm_put()
    just to decrement the reference count of the mapped_device.
    By this change, dm_put() can be used in any context and the symmetric
    model below is introduced:
    dm_create(): create a mapped_device
    dm_destroy(): destroy a mapped_device
    dm_get(): increment the reference count of a mapped_device
    dm_put(): decrement the reference count of a mapped_device

    dm_destroy() waits for all references of the mapped_device to disappear,
    then deletes the mapped_device.

    dm_destroy() uses active waiting with msleep(1), since deleting
    the mapped_device isn't performance-critical task.
    And since at this point, nobody opens the mapped_device and no new
    reference will be taken, the pending counts are just for racing
    completing activity and will eventually decrease to zero.

    For the unlikely case of the forced module unload, dm_destroy_immediate(),
    which doesn't wait and forcibly deletes the mapped_device, is also
    introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
    may be stuck and never return.
    And now, because the mapped_device is deleted at this point, subsequent
    accesses to the mapped_device may cause NULL pointer references.

    Cc: stable@kernel.org
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda