21 Dec, 2018

1 commit

  • commit 687cf4412a343a63928a5c9d91bdc0f522939d43 upstream.

    Otherwise dm_bitset_cursor_begin() return -ENODATA. Other calls to
    dm_bitset_cursor_begin() have similar negative checks.

    Fixes inability to create a cache in passthrough mode (even though doing
    so makes no sense).

    Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty")
    Cc: stable@vger.kernel.org
    Reported-by: David Teigland
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

13 Oct, 2018

1 commit

  • commit 4561ffca88c546f96367f94b8f1e4715a9c62314 upstream.

    Commit fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to
    on-disk superblock") enabled previously written policy hints to be
    used after a cache is reactivated. But in doing so the cache
    metadata's hint array was left exposed to out of bounds access because
    on resize the metadata's on-disk hint array wasn't ever extended.

    Fix this by ignoring that there are no on-disk hints associated with the
    newly added cache blocks. An expanded on-disk hint array is later
    rewritten upon the next clean shutdown of the cache.

    Fixes: fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to on-disk superblock")
    Cc: stable@vger.kernel.org
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

10 Sep, 2018

2 commits

  • commit 5b1fe7bec8a8d0cc547a22e7ddc2bd59acd67de4 upstream.

    Quoting Documentation/device-mapper/cache.txt:

    The 'dirty' state for a cache block changes far too frequently for us
    to keep updating it on the fly. So we treat it as a hint. In normal
    operation it will be written when the dm device is suspended. If the
    system crashes all cache blocks will be assumed dirty when restarted.

    This got broken in commit f177940a8091 ("dm cache metadata: switch to
    using the new cursor api for loading metadata") in 4.9, which removed
    the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
    flag) when loading cache blocks. This results in data corruption on an
    unclean shutdown with dirty cache blocks on the fast device. After the
    crash those blocks are considered clean and may get evicted from the
    cache at any time. This can be demonstrated by doing a lot of reads
    to trigger individual evictions, but uncache is more predictable:

    ### Disable auto-activation in lvm.conf to be able to do uncache in
    ### time (i.e. see uncache doing flushing) when the fix is applied.

    # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
    # vgcreate vg_cache /dev/vdb /dev/vdc
    # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
    # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
    # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
    # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
    # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
    # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # dmsetup status vg_cache-lv_slowdev
    0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
    ^^^^
    7065 * 64k = 441M yet to be written to the slow device
    # echo b >/proc/sysrq-trigger

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 0 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
    0fe00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................

    This is the case with both v1 and v2 cache pool metatata formats.

    After applying this patch:

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 3724 blocks for cache vg_cache/lv_slowdev.
    ...
    Flushing 71 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

    Cc: stable@vger.kernel.org
    Fixes: f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata")
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit fd2fa95416188a767a63979296fa3e169a9ef5ec upstream.

    policy_hint_size starts as 0 during __write_initial_superblock(). It
    isn't until the policy is loaded that policy_hint_size is set in-core
    (cmd->policy_hint_size). But it never got recorded in the on-disk
    superblock because __commit_transaction() didn't deal with transfering
    the in-core cmd->policy_hint_size to the on-disk superblock.

    The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
    __begin_transaction_flags() which re-reads all superblock fields.
    Because the superblock's policy_hint_size was never properly stored, when
    the cache was created, hints_array_available() would always return false
    when re-activating a previously created cache. This means
    __load_mappings() always considered the hints invalid and never made use
    of the hints (these hints served to optimize).

    Another detremental side-effect of this oversight is the cache_check
    utility would fail with: "invalid hint width: 0"

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

06 May, 2017

1 commit


04 May, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - A major update for DM cache that reduces the latency for deciding
    whether blocks should migrate to/from the cache. The bio-prison-v2
    interface supports this improvement by enabling direct dispatch of
    work to workqueues rather than having to delay the actual work
    dispatch to the DM cache core. So the dm-cache policies are much more
    nimble by being able to drive IO as they see fit. One immediate
    benefit from the improved latency is a cache that should be much more
    adaptive to changing workloads.

    - Add a new DM integrity target that emulates a block device that has
    additional per-sector tags that can be used for storing integrity
    information.

    - Add a new authenticated encryption feature to the DM crypt target
    that builds on the capabilities provided by the DM integrity target.

    - Add MD interface for switching the raid4/5/6 journal mode and update
    the DM raid target to use it to enable aid4/5/6 journal write-back
    support.

    - Switch the DM verity target over to using the asynchronous hash
    crypto API (this helps work better with architectures that have
    access to off-CPU algorithm providers, which should reduce CPU
    utilization).

    - Various request-based DM and DM multipath fixes and improvements from
    Bart and Christoph.

    - A DM thinp target fix for a bio structure leak that occurs for each
    discard IFF discard passdown is enabled.

    - A fix for a possible deadlock in DM bufio and a fix to re-check the
    new buffer allocation watermark in the face of competing admin
    changes to the 'max_cache_size_bytes' tunable.

    - A couple DM core cleanups.

    * tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits)
    dm bufio: check new buffer allocation watermark every 30 seconds
    dm bufio: avoid a possible ABBA deadlock
    dm mpath: make it easier to detect unintended I/O request flushes
    dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
    dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
    dm: introduce enum dm_queue_mode to cleanup related code
    dm mpath: verify __pg_init_all_paths locking assumptions at runtime
    dm: verify suspend_locking assumptions at runtime
    dm block manager: remove an unused argument from dm_block_manager_create()
    dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
    dm mpath: delay requeuing while path initialization is in progress
    dm mpath: avoid that path removal can trigger an infinite loop
    dm mpath: split and rename activate_path() to prepare for its expanded use
    dm ioctl: prevent stack leak in dm ioctl call
    dm integrity: use previously calculated log2 of sectors_per_block
    dm integrity: use hex2bin instead of open-coded variant
    dm crypt: replace custom implementation of hex2bin()
    dm crypt: remove obsolete references to per-CPU state
    dm verity: switch to using asynchronous hash crypto API
    dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
    ...

    Linus Torvalds
     

28 Apr, 2017

1 commit


21 Mar, 2017

1 commit


17 Feb, 2017

5 commits


21 Nov, 2016

1 commit


22 Sep, 2016

2 commits


17 Apr, 2016

1 commit

  • Commit 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and
    cleanup WRITE_LOCK macros") uses down_write() instead of down_read() in
    cmd_read_lock(), yet up_read() is used to release the lock in
    READ_UNLOCK(). Fix it.

    Fixes: 9567366fefdd ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros")
    Cc: stable@vger.kernel.org
    Signed-off-by: Ahmed Samy
    Signed-off-by: Mike Snitzer

    Ahmed Samy
     

15 Apr, 2016

1 commit

  • The READ_LOCK macro was incorrectly returning -EINVAL if
    dm_bm_is_read_only() was true -- it will always be true once the cache
    metadata transitions to read-only by dm_cache_metadata_set_read_only().

    Wrap READ_LOCK and WRITE_LOCK multi-statement macros in do {} while(0).
    Also, all accesses of the 'cmd' argument passed to these related macros
    are now encapsulated in parenthesis.

    A follow-up patch can be developed to eliminate the use of macros in
    favor of pure C code. Avoiding that now given that this needs to apply
    to stable@.

    Reported-by: Ben Hutchings
    Signed-off-by: Mike Snitzer
    Fixes: d14fcf3dd79 ("dm cache: make sure every metadata function checks fail_io")
    Cc: stable@vger.kernel.org

    Mike Snitzer
     

11 Mar, 2016

1 commit


05 Nov, 2015

1 commit

  • Pull device mapper updates from Mike Snitzer:
    "Smaller set of DM changes for this merge. I've based these changes on
    Jens' for-4.4/reservations branch because the associated DM changes
    required it.

    - Revert a dm-multipath change that caused a regression for
    unprivledged users (e.g. kvm guests) that issued ioctls when a
    multipath device had no available paths.

    - Include Christoph's refactoring of DM's ioctl handling and add
    support for passing through persistent reservations with DM
    multipath.

    - All other changes are very simple cleanups"

    * tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm switch: simplify conditional in alloc_region_table()
    dm delay: document that offsets are specified in sectors
    dm delay: capitalize the start of an delay_ctr() error message
    dm delay: Use DM_MAPIO macros instead of open-coded equivalents
    dm linear: remove redundant target name from error messages
    dm persistent data: eliminate unnecessary return values
    dm: eliminate unused "bioset" process for each bio-based DM device
    dm: convert ffs to __ffs
    dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()
    dm: add support for passing through persistent reservations
    dm: refactor ioctl handling
    Revert "dm mpath: fix stalls when handling invalid ioctls"
    dm: initialize non-blk-mq queue data before queue is used

    Linus Torvalds
     

01 Nov, 2015

1 commit


24 Oct, 2015

1 commit

  • If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache
    blocks are marked as dirty and a full writeback occurs.

    __commit_transaction() is responsible for setting/clearing
    CLEAN_SHUTDOWN (based the flags_mutator that is passed in).

    Fix this issue, of the cache's on-disk flags being wrong, by making sure
    __commit_transaction() does not reset the flags after the mutator has
    altered the flags in preparation for them being serialized to disk.

    before:

    sb_flags = mutator(le32_to_cpu(disk_super->flags));
    disk_super->flags = cpu_to_le32(sb_flags);
    disk_super->flags = cpu_to_le32(cmd->flags);

    after:

    disk_super->flags = cpu_to_le32(cmd->flags);
    sb_flags = mutator(le32_to_cpu(disk_super->flags));
    disk_super->flags = cpu_to_le32(sb_flags);

    Reported-by: Bogdan Vasiliev
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     

12 Jun, 2015

1 commit

  • If a cache metadata operation fails (e.g. transaction commit) the
    cache's metadata device will abort the current transaction, set a new
    needs_check flag, and the cache will transition to "read-only" mode. If
    aborting the transaction or setting the needs_check flag fails the cache
    will transition to "fail-io" mode.

    Once needs_check is set the cache device will not be allowed to
    activate. Activation requires write access to metadata. Future work is
    needed to add proper support for running the cache in read-only mode.

    Once in fail-io mode the cache will report a status of "Fail".

    Also, add commit() wrapper that will disallow commits if in read_only or
    fail mode.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

28 Jan, 2015

1 commit

  • Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
    inactive and active DM tables") mistakenly ignored the use of ERR_PTR
    returns. Restore missing IS_ERR checks and ERR_PTR returns where
    appropriate.

    Reported-by: Dan Carpenter
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     

23 Jan, 2015

1 commit

  • If a DM table is reloaded with an inactive table when the device is not
    suspended (normal procedure for LVM2), then there will be two dm-bufio
    objects that can diverge. This can lead to a situation where the
    inactive table uses bufio to read metadata at the same time the active
    table writes metadata -- resulting in the inactive table having stale
    metadata buffers once it is promoted to the active table slot.

    Fix this by using reference counting and a global list of cache metadata
    objects to ensure there is only one metadata object per metadata device.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     

11 Nov, 2014

1 commit


02 Aug, 2014

1 commit

  • Commit 7d48935e cleaned up the persistent-data's space-map-metadata
    limits by elevating them to dm-space-map-metadata.h. Update
    dm-cache-metadata to use these same limits.

    The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
    sizeof the disk_bitmap_header. So the supported maximum metadata size
    is a bit smaller (reduced from 33423360 to 33292800 sectors).

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer
     

16 Jul, 2014

1 commit


05 Apr, 2014

1 commit

  • When suspending a cache the policy is walked and the individual policy
    hints written to the metadata via sync_metadata(). This led to this
    lock order:

    policy->lock
    cache_metadata->root_lock

    When loading the cache target the policy is populated while the metadata
    lock is held:

    cache_metadata->root_lock
    policy->lock

    Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
    ensuring the cache_metadata root_lock is held whilst all the hints are
    written, rather than being repeatedly locked while policy->lock is held
    (as was the case with each callout that policy_walk_mappings() made to
    the old save_hint() method).

    Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
    locking correctness") build option. However, it is not clear how the
    LOCKDEP reported paths can lead to a deadlock since the two paths,
    suspending a target and loading a target, never occur at the same time.
    But that doesn't mean the same lock-inversion couldn't have occurred
    elsewhere.

    Reported-by: Marian Csontos
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     

28 Mar, 2014

3 commits

  • In theory copying the space map root can fail, but in practice it never
    does because we're careful to check what size buffer is needed.

    But make certain we're able to copy the space map roots before
    locking the superblock.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed

    Joe Thornber
     
  • The persistent-data library used by dm-thin, dm-cache, etc is
    transactional. If anything goes wrong, such as an io error when writing
    new metadata or a power failure, then we roll back to the last
    transaction.

    Atomicity when committing a transaction is achieved by:

    a) Never overwriting data from the previous transaction.
    b) Writing the superblock last, after all other metadata has hit the
    disk.

    This commit and the following commit ("dm: take care to copy the space
    map roots before locking the superblock") fix a bug associated with (b).
    When committing it was possible for the superblock to still be written
    in spite of an io error occurring during the preceeding metadata flush.
    With these commits we're careful not to take the write lock out on the
    superblock until after the metadata flush has completed.

    Change the transaction manager's semantics for dm_tm_commit() to assume
    all data has been flushed _before_ the single superblock that is passed
    in.

    As a prerequisite, split the block manager's block unlocking and
    flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now
    the unlocking must be done separately.

    This issue was discovered by forcing io errors at the crucial time
    using dm-flakey.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Joe Thornber
     
  • Discard block size not being equal to cache block size causes data
    corruption by erroneously avoiding migrations in issue_copy() because
    the discard state is being cleared for a group of cache blocks when it
    should not.

    Completely remove all code that enabled a distinction between the
    cache block size and discard block size.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer

    Heinz Mauelshagen
     

12 Nov, 2013

3 commits

  • Need to check the version to verify on-disk metadata is supported.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • "Passthrough" is a dm-cache operating mode (like writethrough or
    writeback) which is intended to be used when the cache contents are not
    known to be coherent with the origin device. It behaves as follows:

    * All reads are served from the origin device (all reads miss the cache)
    * All writes are forwarded to the origin device; additionally, write
    hits cause cache block invalidates

    This mode decouples cache coherency checks from cache device creation,
    largely to avoid having to perform coherency checks while booting. Boot
    scripts can create cache devices in passthrough mode and put them into
    service (mount cached filesystems, for example) without having to worry
    about coherency. Coherency that exists is maintained, although the
    cache will gradually cool as writes take place.

    Later, applications can perform coherency checks, the nature of which
    will depend on the type of the underlying storage. If coherency can be
    verified, the cache device can be transitioned to writethrough or
    writeback mode while still warm; otherwise, the cache contents can be
    discarded prior to transitioning to the desired operating mode.

    Signed-off-by: Joe Thornber
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Morgan Mears
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Allow a cache to shrink if the blocks being removed from the cache are
    not dirty.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

10 Nov, 2013

1 commit


10 May, 2013

1 commit


21 Mar, 2013

3 commits

  • When reading the dm cache metadata from disk, ignore the policy hints
    unless they were generated by the same major version number of the same
    policy module.

    The hints are considered to be private data belonging to the specific
    module that generated them and there is no requirement for them to make
    sense to different versions of the policy that generated them.
    Policy modules are all required to work fine if no previous hints are
    supplied (or if existing hints are lost).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Separate dm cache policy version string into 3 unsigned numbers
    corresponding to major, minor and patchlevel and store them at the end
    of the on-disk metadata so we know which version of the policy generated
    the hints in case a future version wants to use them differently.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • When writing the dirty bitset to the metadata device on a clean
    shutdown, clear the dirty bits. Previously they were left indicating
    the cache was dirty. This led to confusion about whether there really
    was dirty data in the cache or not. (This was a harmless bug.)

    Reported-by: Darrick J. Wong
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber