30 Jan, 2015

1 commit

  • commit a59db67656021fa212e9b95a583f13c34eb67cd9 upstream.

    Introduce a new variable to count the number of allocated migration
    structures. The existing variable cache->nr_migrations became
    overloaded. It was used to:

    i) track of the number of migrations in flight for the purposes of
    quiescing during suspend.

    ii) to estimate the amount of background IO occuring.

    Recent discard changes meant that REQ_DISCARD bios are processed with
    a migration. Discards are not background IO so nr_migrations was not
    incremented. However this could cause quiescing to complete early.

    (i) is now handled with a new variable cache->nr_allocated_migrations.
    cache->nr_migrations has been renamed cache->nr_io_migrations.
    cleanup_migration() is now called free_io_migration(), since it
    decrements that variable.

    Also, remove the unused cache->next_migration variable that got replaced
    with with prealloc_structs a while ago.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

09 Jan, 2015

3 commits


10 Sep, 2014

1 commit

  • When a writeback or a promotion of a block is completed, the cell of
    that block is removed from the prison, the block is marked as clean, and
    the clear_dirty() callback of the cache policy is called.

    Unfortunately, performing those actions in this order allows an incoming
    new write bio for that block to come in before clearing the dirty status
    is completed and therefore possibly causing one of these two scenarios:

    Scenario A:

    Thread 1 Thread 2
    cell_defer() .
    - cell removed from prison .
    - detained bios queued .
    . incoming write bio
    . remapped to cache
    . set_dirty() called,
    . but block already dirty
    . => it does nothing
    clear_dirty() .
    - block marked clean .
    - policy clear_dirty() called .

    Result: Block is marked clean even though it is actually dirty. No
    writeback will occur.

    Scenario B:

    Thread 1 Thread 2
    cell_defer() .
    - cell removed from prison .
    - detained bios queued .
    clear_dirty() .
    - block marked clean .
    . incoming write bio
    . remapped to cache
    . set_dirty() called
    . - block marked dirty
    . - policy set_dirty() called
    - policy clear_dirty() called .

    Result: Block is properly marked as dirty, but policy thinks it is clean
    and therefore never asks us to writeback it.
    This case is visible in "dmsetup status" dirty block count (which
    normally decreases to 0 on a quiet device).

    Fix these issues by calling clear_dirty() before calling cell_defer().
    Incoming bios for that block will then be detained in the cell and
    released only after clear_dirty() has completed, so the race will not
    occur.

    Found by inspecting the code after noticing spurious dirty counts
    (scenario B).

    Signed-off-by: Anssi Hannula
    Acked-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Anssi Hannula
     

02 Aug, 2014

5 commits

  • Before, if the block layer's limit stacking didn't establish an
    optimal_io_size that was compatible with the cache's data block size
    we'd set optimal_io_size to the data block size and minimum_io_size to 0
    (which the block layer adjusts to be physical_block_size).

    Update cache_io_hints() to set both minimum_io_size and optimal_io_size
    to the cache's data block size. This fixes an issue where mkfs.xfs
    would create more XFS Allocation Groups on cache volumes than on a
    normal linear LV of comparable size.

    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • Commit 7d48935e cleaned up the persistent-data's space-map-metadata
    limits by elevating them to dm-space-map-metadata.h. Update
    dm-cache-metadata to use these same limits.

    The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
    sizeof the disk_bitmap_header. So the supported maximum metadata size
    is a bit smaller (reduced from 33423360 to 33292800 sectors).

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Factor out inc_and_issue and inc_ds helpers to simplify deferred set
    reference count increments. Also cleanup cache_map to consistently call
    cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.

    No functional change.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • nr_dirty is updated without locking, causing it to drift so that it is
    non-zero (either a small positive integer, or a very large one when an
    underflow occurs) even when there are no actual dirty blocks. This was
    due to a race between the workqueue and map function accessing nr_dirty
    in parallel without proper protection.

    People were seeing under runs due to a race on increment/decrement of
    nr_dirty, see: https://lkml.org/lkml/2014/6/3/648

    Fix this by using an atomic_t for nr_dirty.

    Reported-by: roma1390@gmail.com
    Signed-off-by: Anssi Hannula
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Anssi Hannula
     

27 May, 2014

1 commit


02 May, 2014

1 commit

  • Commit 2ee57d58735 ("dm cache: add passthrough mode") inadvertently
    removed the deferred set reference that was taken in cache_map()'s
    writethrough mode support. Restore taking this reference.

    This issue was found with code inspection.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Cc: stable@vger.kernel.org # 3.13+

    Mike Snitzer
     

05 Apr, 2014

1 commit

  • When suspending a cache the policy is walked and the individual policy
    hints written to the metadata via sync_metadata(). This led to this
    lock order:

    policy->lock
    cache_metadata->root_lock

    When loading the cache target the policy is populated while the metadata
    lock is held:

    cache_metadata->root_lock
    policy->lock

    Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
    ensuring the cache_metadata root_lock is held whilst all the hints are
    written, rather than being repeatedly locked while policy->lock is held
    (as was the case with each callout that policy_walk_mappings() made to
    the old save_hint() method).

    Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
    locking correctness") build option. However, it is not clear how the
    LOCKDEP reported paths can lead to a deadlock since the two paths,
    suspending a target and loading a target, never occur at the same time.
    But that doesn't mean the same lock-inversion couldn't have occurred
    elsewhere.

    Reported-by: Marian Csontos
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     

28 Mar, 2014

2 commits

  • Discard block size not being equal to cache block size causes data
    corruption by erroneously avoiding migrations in issue_copy() because
    the discard state is being cleared for a group of cache blocks when it
    should not.

    Completely remove all code that enabled a distinction between the
    cache block size and discard block size.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer

    Heinz Mauelshagen
     
  • If the discard block size is larger than the cache block size we will
    not properly quiesce IO to a region that is about to be discarded. This
    results in a race between a cache migration where no copy is needed, and
    a write to an adjacent cache block that's within the same large discard
    block.

    Workaround this by limiting the discard_block_size to cache_block_size.
    Also limit the max_discard_sectors to cache_block_size.

    A more comprehensive fix that introduces range locking support in the
    bio_prison and proper quiescing of a discard range that spans multiple
    cache blocks is already in development.

    Reported-by: Morgan Mears
    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Acked-by: Heinz Mauelshagen
    Cc: stable@vger.kernel.org

    Mike Snitzer
     

13 Mar, 2014

2 commits

  • In order to avoid wasting cache space a partial block at the end of the
    origin device is not cached. Unfortunately, the check for such a
    partial block at the end of the origin device was flawed.

    Fix accesses beyond the end of the origin device that occured due to
    attempted promotion of an undetected partial block by:

    - initializing the per bio data struct to allow cache_end_io to work properly
    - recognizing access to the partial block at the end of the origin device
    - avoiding out of bounds access to the discard bitset

    Otherwise, users can experience errors like the following:

    attempt to access beyond end of device
    dm-5: rw=0, want=20971520, limit=20971456
    ...
    device-mapper: cache: promotion failed; couldn't copy block

    Signed-off-by: Heinz Mauelshagen
    Acked-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Heinz Mauelshagen
     
  • During demotion or promotion to a cache's >2TB fast device we must not
    truncate the cache block's associated sector to 32bits. The 32bit
    temporary result of from_cblock() caused a 32bit multiplication when
    calculating the sector of the fast device in issue_copy_real().

    Use an intermediate 64bit type to store the 32bit from_cblock() to allow
    for proper 64bit multiplication.

    Here is an example of how this bug manifests on an ext4 filesystem:

    EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
    JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

    Signed-off-by: Heinz Mauelshagen
    Acked-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Heinz Mauelshagen
     

28 Feb, 2014

1 commit

  • When remapping a block to the cache's fast device that is larger than
    2TB we must not truncate the destination sector to 32bits. The 32bit
    temporary result of from_cblock() was being overflowed in
    remap_to_cache() due to the logical left shift.

    Use an intermediate 64bit type to store the 32bit from_cblock() result
    to fix the overflow.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Heinz Mauelshagen
     

18 Feb, 2014

2 commits

  • When completing an overwrite bio, in overwrite_endio(), the associated
    migration should not be added to the 'completed_migrations' until the
    bio's fields are restored with dm_unhook_bio().

    Otherwise, do_worker() can race to process 'completed_migrations' before
    dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is
    unlikely to cause any problems given the current code but should be
    fixed on the basis of correctness.

    Also, the cache's spinlock only needs to be held when manipulating the
    'completed_migrations' list -- other changes don't need protection.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • Commit c9d28d5d ("dm cache: promotion optimisation for writes")
    incorrectly placed the 'hook_info' member in the writethrough-only
    portion of the per_bio_data structure.

    Given that the overwrite optimization may be used for writeback the
    'hook_info' member must be placed above the 'cache' member of the
    per_bio_data structure. Any members above 'cache' are available from
    both writeback and writethrough modes' per_bio_data structure.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Cc: stable@vger.kernel.org # 3.13+

    Mike Snitzer
     

31 Jan, 2014

1 commit

  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

17 Jan, 2014

1 commit

  • The cache's policy may have been established using the "default" alias,
    which is currently the "mq" policy but the default policy may change in
    the future. It is useful to know exactly which policy is being used.

    Add a 'real' member to the dm_cache_policy_type structure and have the
    "default" dm_cache_policy_type point to the real "mq"
    dm_cache_policy_type. Update dm_cache_policy_get_name() to check if
    real is set, if so report the name of the real policy (not the alias).

    Requested-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     

10 Jan, 2014

1 commit

  • Improve cache_status to emit:
    /
    /
    ...

    Adding the block sizes allows for easier calculation of the overall size
    of both the metadata and cache devices. Adding
    provides useful context for how much of the cache is used.

    Unfortunately these additions to the status will require updates to
    users' scripts that monitor the cache status. But these changes help
    provide more comprehensive information about the cache device and will
    simplify tools that are being developed to manage dm-cache devices --
    because they won't need to issue 3 operations to cobble together the
    information that we can easily provide via a single status ioctl.

    While updating the status documentation in cache.txt spaces were
    tabify'd.

    Requested-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer
     

01 Jan, 2014

1 commit

  • Needed to bring blk-mq uptodate, since changes have been going in
    since for-3.14/core was established.

    Fixup merge issues related to the immutable biovec changes.

    Signed-off-by: Jens Axboe

    Conflicts:
    block/blk-flush.c
    fs/btrfs/check-integrity.c
    fs/btrfs/extent_io.c
    fs/btrfs/scrub.c
    fs/logfs/dev_bdev.c

    Jens Axboe
     

11 Dec, 2013

1 commit

  • Commit f494a9c6b1b6dd9a9f21bbb75d9210d478eeb498 ("dm cache: cache
    shrinking support") broke cache resizing support.

    dm_cache_resize() is called with cache->cache_size before it gets
    updated to new_size, so it is a no-op. But the dm-cache superblock is
    updated with the new_size even though the backing dm-array is not
    resized. Fix this by passing the new_size to dm_cache_resize().

    Signed-off-by: Vincent Pelletier
    Acked-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Vincent Pelletier
     

04 Dec, 2013

1 commit


24 Nov, 2013

2 commits

  • This adds a generic mechanism for chaining bio completions. This is
    going to be used for a bio_split() replacement, and it turns out to be
    very useful in a fair amount of driver code - a fair number of drivers
    were implementing this in their own roundabout ways, often painfully.

    Note that this means it's no longer to call bio_endio() more than once
    on the same bio! This can cause problems for drivers that save/restore
    bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
    - in all but the simplest cases they'd be better off just cloning the
    bio, and immutable biovecs is making bio cloning cheaper. But for now,
    we add a bio_endio_nodec() for these cases.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe

    Kent Overstreet
     
  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

13 Nov, 2013

1 commit


12 Nov, 2013

3 commits

  • Cache block invalidation is removing an entry from the cache without
    writing it back. Cache blocks can be invalidated via the
    'invalidate_cblocks' message, which takes an arbitrary number of cblock
    ranges:
    invalidate_cblocks [|-]*

    E.g.
    dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • "Passthrough" is a dm-cache operating mode (like writethrough or
    writeback) which is intended to be used when the cache contents are not
    known to be coherent with the origin device. It behaves as follows:

    * All reads are served from the origin device (all reads miss the cache)
    * All writes are forwarded to the origin device; additionally, write
    hits cause cache block invalidates

    This mode decouples cache coherency checks from cache device creation,
    largely to avoid having to perform coherency checks while booting. Boot
    scripts can create cache devices in passthrough mode and put them into
    service (mount cached filesystems, for example) without having to worry
    about coherency. Coherency that exists is maintained, although the
    cache will gradually cool as writes take place.

    Later, applications can perform coherency checks, the nature of which
    will depend on the type of the underlying storage. If coherency can be
    verified, the cache device can be transitioned to writethrough or
    writeback mode while still warm; otherwise, the cache contents can be
    discarded prior to transitioning to the desired operating mode.

    Signed-off-by: Joe Thornber
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Morgan Mears
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Allow a cache to shrink if the blocks being removed from the cache are
    not dirty.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

10 Nov, 2013

8 commits