01 Feb, 2013

1 commit


31 Jan, 2013

2 commits

  • When processing write same requests, fix dm to send the configured
    number of WRITE SAME requests to the target rather than the number of
    discards, which is not always the same.

    Device-mapper WRITE SAME support was introduced by commit
    23508a96cd2e857d57044a2ed7d305f2d9daf441 ("dm: add WRITE SAME support").

    Signed-off-by: Alasdair G Kergon
    Acked-by: Mike Snitzer

    Alasdair G Kergon
     
  • thin_io_hints() is blindly copying the queue limits from the thin-pool
    which can lead to incorrect limits being set. The fix here simply
    deletes the thin_io_hints() hook which leaves the existing stacking
    infrastructure to set the limits correctly.

    When a thin-pool uses an MD device for the data device a thin device
    from the thin-pool must respect MD's constraints about disallowing a bio
    from spanning multiple chunks. Otherwise we can see problems. If the raid0
    chunksize is 1152K and thin-pool chunksize is 256K I see the following
    md/raid0 error (with extra debug tracing added to thin_endio) when
    mkfs.xfs is executed against the thin device:

    md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
    device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0

    This extra DM debugging shows that the failing bio is spanning across
    the first and second logical 1152K chunk (sector 2080 + 255 takes the
    bio beyond the first chunk's boundary of sector 2304). So the bio
    splitting that DM is doing clearly isn't respecting the MD limits.

    max_hw_sectors_kb is 127 for both the thin-pool and thin device
    (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
    precision). So this explains why bi_size is 130560.

    But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
    that it doesn't have a .merge function (for bio_add_page to consult
    indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
    device that has a compulsory merge_bvec_fn. This scenario is exactly
    why DM must resort to sending single PAGE_SIZE bios to the underlying
    layer. Some additional context for this is available in the header for
    commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").

    Long story short, the reason a thin device doesn't properly get
    configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
    thin_io_hints() is blindly copying the queue limits from the thin-pool
    device directly to the thin device's queue limits.

    Fix this by eliminating thin_io_hints. Doing so is safe because the
    block layer's queue limits stacking already enables the upper level thin
    device to inherit the thin-pool device's discard and minimum_io_size and
    optimal_io_size limits that get set in pool_io_hints. But avoiding the
    queue limits copy allows the thin and thin-pool limits to be different
    where it is important, namely max_hw_sectors_kb.

    Reported-by: Daniel Browning
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

24 Jan, 2013

1 commit

  • Before attempting to activate a RAID array, it is checked for sufficient
    redundancy. That is, we make sure that there are not too many failed
    devices - or devices specified for rebuild - to undermine our ability to
    activate the array. The current code performs this check twice - once to
    ensure there were not too many devices specified for rebuild by the user
    ('validate_rebuild_devices') and again after possibly experiencing a failure
    to read the superblock ('analyse_superblocks'). Neither of these checks are
    sufficient. The first check is done properly but with insufficient
    information about the possible failure state of the devices to make a good
    determination if the array can be activated. The second check is simply
    done wrong in the case of RAID10 because it doesn't account for the
    independence of the stripes (i.e. mirror sets). The solution is to use the
    properly written check ('validate_rebuild_devices'), but perform the check
    after the superblocks have been read and we know which devices have failed.
    This gives us one check instead of two and performs it in a location where
    it can be done right.

    Only RAID10 was affected and it was affected in the following ways:
    - the code did not properly catch the condition where a user specified
    a device for rebuild that already had a failed device in the same mirror
    set. (This condition would, however, be caught at a deeper level in MD.)
    - the code triggers a false positive and denies activation when devices in
    independent mirror sets have failed - counting the failures as though they
    were all in the same set.

    The most likely place this error was introduced (or this patch should have
    been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
    Consequently this fix should also go in v3.7.y, however there is a
    small conflict on the .version in raid_target, so I'll submit a
    separate patch to -stable.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     

22 Dec, 2012

36 commits

  • Pull dm update from Alasdair G Kergon:
    "Miscellaneous device-mapper fixes, cleanups and performance
    improvements.

    Of particular note:
    - Disable broken WRITE SAME support in all targets except linear and
    striped. Use it when kcopyd is zeroing blocks.
    - Remove several mempools from targets by moving the data into the
    bio's new front_pad area(which dm calls 'per_bio_data').
    - Fix a race in thin provisioning if discards are misused.
    - Prevent userspace from interfering with the ioctl parameters and
    use kmalloc for the data buffer if it's small instead of vmalloc.
    - Throttle some annoying error messages when I/O fails."

    * tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
    dm stripe: add WRITE SAME support
    dm: remove map_info
    dm snapshot: do not use map_context
    dm thin: dont use map_context
    dm raid1: dont use map_context
    dm flakey: dont use map_context
    dm raid1: rename read_record to bio_record
    dm: move target request nr to dm_target_io
    dm snapshot: use per_bio_data
    dm verity: use per_bio_data
    dm raid1: use per_bio_data
    dm: introduce per_bio_data
    dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
    dm linear: add WRITE SAME support
    dm: add WRITE SAME support
    dm: prepare to support WRITE SAME
    dm ioctl: use kmalloc if possible
    dm ioctl: remove PF_MEMALLOC
    dm persistent data: improve improve space map block alloc failure message
    dm thin: use DMERR_LIMIT for errors
    ...

    Linus Torvalds
     
  • Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
    SAME bio processing.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • This patch removes map_info from bio-based device mapper targets.
    map_info is still used for request-based targets.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Eliminate struct map_info from dm-snap.

    map_info->ptr was used in dm-snap to indicate if the bio was tracked.
    If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.

    This patch removes the use of map_info->ptr. We determine if the bio was
    tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
    the bio is not tracked, if it is false, the bio is tracked.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.

    This patch removes any use of map_info in preparation for the next patch
    that removes map_info from bio-based device mapper.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Don't use map_info any more in dm-raid1.

    map_info was used for writes to hold the region number. For this purpose
    we add a new field dm_bio_details to dm_raid1_bio_record.

    map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
    the pointer was non-NULL, bio details were saved; if the pointer was
    NULL, bio details were not saved). We use
    dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
    NULL, details were not saved, if bi_bdev is non-NULL, details were
    saved.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Rename struct read_record to bio_record in dm-raid1.

    In the following patch, the structure will be used for both read and
    write bios, so rename it.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch moves target_request_nr from map_info to dm_target_io and
    makes it accessible with dm_bio_get_target_request_nr.

    This patch is a preparation for the next patch that removes map_info.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace tracked_chunk_pool with per_bio_data in dm-snap.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace io_mempool with per_bio_data in dm-verity.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace read_record_pool with per_bio_data in dm-raid1.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Introduce a field per_bio_data_size in struct dm_target.

    Targets can set this field in the constructor. If a target sets this
    field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
    are allocated for each bio submitted to the target. These data can be
    used for any purpose by the target and help us improve performance by
    removing some per-target mempools.

    Per-bio data is accessed with dm_per_bio_data. The
    argument data_size must be the same as the value per_bio_data_size in
    dm_target.

    If the target has a pointer to per_bio_data, it can get a pointer to
    the bio with dm_bio_from_per_bio_data() function (data_size must be the
    same as the value passed to dm_per_bio_data).

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add WRITE SAME support to dm-io and make it accessible to
    dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
    whereas the blkdev_issue_write_same() interface is synchronous.

    WRITE SAME is a SCSI command that can be leveraged for more efficient
    zeroing of a specified logical extent of a device which supports it.
    Only a single zeroed logical block is transfered to the target for each
    WRITE SAME and the target then writes that same block across the
    specified extent.

    The dm thin target uses this.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The linear target can already support WRITE SAME requests so signal
    this by setting num_write_same_requests to 1.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • WRITE SAME bios have a payload that contain a single page. When
    cloning WRITE SAME bios DM has no need to modify the bi_io_vec
    attributes (and doing so would be detrimental). DM need only alter the
    start and end of the WRITE SAME bio accordingly.

    Rather than duplicate __clone_and_map_discard, factor out a common
    function that is also used by __clone_and_map_write_same.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Allow targets to opt in to WRITE SAME support by setting
    'num_write_same_requests' in the dm_target structure.

    A dm device will only advertise WRITE SAME support if all its
    targets and all its underlying devices support it.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • If the parameter buffer is small enough, try to allocate it with kmalloc()
    rather than vmalloc().

    vmalloc is noticeably slower than kmalloc because it has to manipulate
    page tables.

    In my tests, on PA-RISC this patch speeds up activation 13 times.
    On Opteron this patch speeds up activation by 5%.

    This patch introduces a new function free_params() to free the
    parameters and this uses new flags that record whether or not vmalloc()
    was used and whether or not the input buffer must be wiped after use.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • When allocating memory for the userspace ioctl data, set some
    appropriate GPF flags directly instead of using PF_MEMALLOC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Improve space map error message when unable to allocate a new
    metadata block.

    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Throttle all errors logged from the IO path by dm thin.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Nearly all of persistent-data is in the IO path so throttle error
    messages with DMERR_LIMIT to limit the amount logged when
    something has gone wrong.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Reinstate a useful error message when the block manager buffer validator fails.
    This was mistakenly eliminated when the block manager was converted to use
    dm-bufio.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • If the user does not supply a bitmap region_size to the dm raid target,
    a reasonable size is computed automatically. If this is not a power of 2,
    the md code will report an error later.

    This patch catches the problem early and rounds the region_size to the
    next power of two.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Remove unused @data_block parameter from cell_defer.
    Change thin_bio_map to use many returns rather than setting a variable.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Rename cell_defer_except() to cell_defer_no_holder() which describes
    its function more clearly.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • track_chunk is always called with interrupts enabled. Consequently, we
    do not need to save and restore interrupt state in "flags" variable.
    This patch changes spin_lock_irqsave to spin_lock_irq and
    spin_unlock_irqrestore to spin_unlock_irq.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
    that tests if read_record is non-NULL is always true.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • If "ignore_discard" is specified when creating the thin pool device then
    discard support is disabled for that device. The pool device's status
    should reflect this fact rather than stating "no_discard_passdown"
    (which implies discards are enabled but passdown is disabled).

    Reported-by: Zdenek Kabelac
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • When deleting nested btrees, the code forgets to delete the innermost
    btree. The thin-metadata code serendipitously compensates for this by
    claiming there is one extra layer in the tree.

    This patch corrects both problems.

    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • When discards are prepared it is best to directly wake the worker that
    will process them. The worker will be woken anyway, via periodic
    commit, but there is no reason to not wake_worker here.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • There is a race when discard bios and non-discard bios are issued
    simultaneously to the same block.

    Discard support is expensive for all thin devices precisely because you
    have to be careful to quiesce the area you're discarding. DM thin must
    handle this conflicting IO pattern (simultaneous non-discard vs discard)
    even though a sane application shouldn't be issuing such IO.

    The race manifests as follows:

    1. A non-discard bio is mapped in thin_bio_map.
    This doesn't lock out parallel activity to the same block.

    2. A discard bio is issued to the same block as the non-discard bio.

    3. The discard bio is locked in a dm_bio_prison_cell in process_discard
    to lock out parallel activity against the same block.

    4. The non-discard bio's mapping continues and its all_io_entry is
    incremented so the bio is accounted for in the thin pool's all_io_ds
    which is a dm_deferred_set used to track time locality of non-discard IO.

    5. The non-discard bio is finally locked in a dm_bio_prison_cell in
    process_bio.

    The race can result in deadlock, leaving the block layer hanging waiting
    for completion of a discard bio that never completes, e.g.:

    INFO: task ruby:15354 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    ruby D ffffffff8160f0e0 0 15354 15314 0x00000000
    ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
    ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
    ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x195/0x220
    [] ? _dm_request+0x111/0x160 [dm_mod]
    [] wait_for_common+0x11e/0x190
    [] ? try_to_wake_up+0x2b0/0x2b0
    [] wait_for_completion+0x1d/0x20
    [] blkdev_issue_discard+0x219/0x260
    [] blkdev_ioctl+0x6e9/0x7b0
    [] block_ioctl+0x3c/0x40
    [] do_vfs_ioctl+0x8c/0x340
    [] ? block_llseek+0x67/0xb0
    [] sys_ioctl+0xa1/0xb0
    [] ? sys_rt_sigprocmask+0x86/0xd0
    [] system_call_fastpath+0x16/0x1b

    The thinp-test-suite's test_discard_random_sectors reliably hits this
    deadlock on fast SSD storage.

    The fix for this race is that the all_io_entry for a bio must be
    incremented whilst the dm_bio_prison_cell is held for the bio's
    associated virtual and physical blocks. That cell locking wasn't
    occurring early enough in thin_bio_map. This patch fixes this.

    Care is taken to always call the new function inc_all_io_entry() with
    the relevant cells locked, but they are generally unlocked before
    calling issue() to try to avoid holding the cells locked across
    generic_submit_request.

    Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
    longer the only thread that will do so. Because of this we must be sure
    to use cell_defer_except() to release all non-holder entries, that
    were added by the other thread, because they must be deferred.

    This patch depends on "dm thin: replace dm_cell_release_singleton with
    cell_defer_except".

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: stable@vger.kernel.org

    Joe Thornber
     
  • Change existing users of the function dm_cell_release_singleton to share
    cell_defer_except instead, and then remove the now-unused function.

    Everywhere that calls dm_cell_release_singleton, the bio in question
    is the holder of the cell.

    If there are no non-holder entries in the cell then cell_defer_except
    behaves exactly like dm_cell_release_singleton. Conversely, if there
    *are* non-holder entries then dm_cell_release_singleton must not be used
    because those entries would need to be deferred.

    Consequently, it is safe to replace use of dm_cell_release_singleton
    with cell_defer_except.

    This patch is a pre-requisite for "dm thin: fix race between
    simultaneous io and discards to same block".

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • WRITE SAME bios are not yet handled correctly by device-mapper so
    disable their use on device-mapper devices by setting
    max_write_same_sectors to zero.

    As an example, a ciphertext device is incompatible because the data
    gets changed according to the location at which it written and so the
    dm crypt target cannot support it.

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Cc: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Abort dm ioctl processing if userspace changes the data_size parameter
    after we validated it but before we finished copying the data buffer
    from userspace.

    The dm ioctl parameters are processed in the following sequence:
    1. ctl_ioctl() calls copy_params();
    2. copy_params() makes a first copy of the fixed-sized portion of the
    userspace parameters into the local variable "tmp";
    3. copy_params() then validates tmp.data_size and allocates a new
    structure big enough to hold the complete data and copies the whole
    userspace buffer there;
    4. ctl_ioctl() reads userspace data the second time and copies the whole
    buffer into the pointer "param";
    5. ctl_ioctl() reads param->data_size without any validation and stores it
    in the variable "input_param_size";
    6. "input_param_size" is further used as the authoritative size of the
    kernel buffer.

    The problem is that userspace code could change the contents of user
    memory between steps 2 and 4. In particular, the data_size parameter
    can be changed to an invalid value after the kernel has validated it.
    This lets userspace force the kernel to access invalid kernel memory.

    The fix is to ensure that the size has not changed at step 4.

    This patch shouldn't have a security impact because CAP_SYS_ADMIN is
    required to run this code, but it should be fixed anyway.

    Reported-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Alasdair G Kergon