01 Oct, 2020

3 commits

  • commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a upstream.

    dm_queue_split() is removed because __split_and_process_bio() _must_
    handle splitting bios to ensure proper bio submission and completion
    ordering as a bio is split.

    Otherwise, multiple recursive calls to ->submit_bio will cause multiple
    split bios to be allocated from the same ->bio_split mempool at the same
    time. This would result in deadlock in low memory conditions because no
    progress could be made (only one bio is available in ->bio_split
    mempool).

    This fix has been verified to still fix the loss of performance, due
    to excess splitting, that commit 120c9257f5f1 provided.

    Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
    Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
    Reported-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • [ Upstream commit 34cf78bf34d48dddddfeeadb44f9841d7864997a ]

    This patch fix a lost wake-up problem caused by the race between
    mca_cannibalize_lock and bch_cannibalize_unlock.

    Consider two processes, A and B. Process A is executing
    mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
    and is executing bch_cannibalize_unlock. The problem happens that after
    process A executes cmpxchg and will execute prepare_to_wait. In this
    timeslice process B executes wake_up, but after that process A executes
    prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
    goes to sleep but no one will wake up it. This problem may cause bcache
    device to dead.

    Signed-off-by: Guoju Fang
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Guoju Fang
     
  • [ Upstream commit 6ba01df72b4b63a26b4977790f58d8f775d2992c ]

    Partitioned request-based devices cannot be used as underlying devices
    for request-based DM because no partition offsets are added to each
    incoming request. As such, until now, stacking on partitioned devices
    would _always_ result in data corruption (e.g. wiping the partition
    table, writing to other partitions, etc). Fix this by disallowing
    request-based stacking on partitions.

    While at it, since all .request_fn support has been removed from block
    core, remove legacy dm-table code that differentiated between blk-mq and
    .request_fn request-based.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Mike Snitzer
     

23 Sep, 2020

2 commits

  • commit e2ec5128254518cae320d5dc631b71b94160f663 upstream.

    DM was calling generic_fsdax_supported() to determine whether a device
    referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that
    they don't have to duplicate common generic checks. High level code
    should call dax_supported() helper which that calls into appropriate
    helper for the particular device. This problem manifested itself as
    kernel messages:

    dm-3: error: dax access failed (-95)

    when lvm2-testsuite run in cases where a DM device was stacked on top of
    another DM device.

    Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
    Cc:
    Tested-by: Adrian Huang
    Signed-off-by: Jan Kara
    Acked-by: Mike Snitzer
    Reported-by: kernel test robot
    Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 02186d8897d49b0afd3c80b6cf23437d91024065 upstream.

    A recent fix to the dm_dax_supported() flow uncovered a latent bug. When
    dm_get_live_table() fails it is still required to drop the
    srcu_read_lock(). Without this change the lvm2 test-suite triggers this
    warning:

    # lvm2-testsuite --only pvmove-abort-all.sh

    WARNING: lock held when returning to user space!
    5.9.0-rc5+ #251 Tainted: G OE
    ------------------------------------------------
    lvm/1318 is leaving the kernel with locks still held!
    1 lock held by lvm/1318:
    #0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod]

    ...and later on this hang signature:

    INFO: task lvm:1344 blocked for more than 122 seconds.
    Tainted: G OE 5.9.0-rc5+ #251
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000
    Call Trace:
    __schedule+0x45f/0xa80
    ? finish_task_switch+0x249/0x2c0
    ? wait_for_completion+0x86/0x110
    schedule+0x5f/0xd0
    schedule_timeout+0x212/0x2a0
    ? __schedule+0x467/0xa80
    ? wait_for_completion+0x86/0x110
    wait_for_completion+0xb0/0x110
    __synchronize_srcu+0xd1/0x160
    ? __bpf_trace_rcu_utilization+0x10/0x10
    __dm_suspend+0x6d/0x210 [dm_mod]
    dm_suspend+0xf6/0x140 [dm_mod]

    Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
    Cc:
    Cc: Jan Kara
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Reported-by: Adrian Huang
    Reviewed-by: Ira Weiny
    Tested-by: Adrian Huang
    Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

10 Sep, 2020

7 commits

  • commit 3a653b205f29b3f9827a01a0c88bfbcb0d169494 upstream.

    The following error ocurred when testing disk online/offline:

    [ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
    [ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
    [ 301.849206] Aborting journal on device dm-26-8.
    [ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
    [ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
    [ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]

    Reason is:

    metadata_operation_failed
    abort_transaction
    dm_pool_abort_metadata
    __create_persistent_data_objects
    r = __open_or_format_metadata
    if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
    dm_block_manager_destroy(pmd->bm);
    set_pool_mode
    dm_pool_metadata_read_only(pool->pmd);
    dm_bm_set_read_only(pmd->bm); --> use-after-free

    Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
    dm_bm_set_read_write functions. If bm is NULL it means creating the
    bm failed and so dm_bm_is_read_only must return true.

    Signed-off-by: Ye Bin
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ye Bin
     
  • commit 219403d7e56f9b716ad80ab87db85d29547ee73e upstream.

    Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
    pointer, it will lead to some strange things.

    Signed-off-by: Ye Bin
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ye Bin
     
  • commit d16ff19e69ab57e08bf908faaacbceaf660249de upstream.

    Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
    pointer, it will lead to some strange things.

    Signed-off-by: Ye Bin
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ye Bin
     
  • commit 7785a9e4c228db6d01086a52d5685cd7336a08b7 upstream.

    Use the DECLARE_CRYPTO_WAIT() macro to properly initialize the crypto
    wait structures declared on stack before their use with
    crypto_wait_req().

    Fixes: 39d13a1ac41d ("dm crypt: reuse eboiv skcipher for IV generation")
    Fixes: bbb1658461ac ("dm crypt: Implement Elephant diffuser for Bitlocker compatibility")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit e27fec66f0a94e35a35548bd0b29ae616e62ec62 upstream.

    The dm-integrity target did not report errors in bitmap mode just after
    creation. The reason is that the function integrity_recalc didn't clean up
    ic->recalc_bitmap as it proceeded with recalculation.

    Fix this by updating the bitmap accordingly -- the double shift serves
    to rounddown.

    Signed-off-by: Mikulas Patocka
    Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit c322ee9320eaa4013ca3620b1130992916b19b31 upstream.

    Commit 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
    changed flush_multipath_work() to avoid needless workqueue
    flushing (of a multipath global workqueue). But that change didn't
    realize the surrounding flush_multipath_work() code should also only
    run if 'pg_init_in_progress' is set.

    Fix this by only doing all of flush_multipath_work()'s PG init related
    work if 'pg_init_in_progress' is set.

    Otherwise multipath_wait_for_pg_init_completion() will run
    unconditionally but the preceeding flush_workqueue(kmpath_handlerd)
    may not. This could lead to deadlock (though only if kmpath_handlerd
    never runs a corresponding work to decrement 'pg_init_in_progress').

    It could also be, though highly unlikely, that the kmpath_handlerd
    work that does PG init completes before 'pg_init_in_progress' is set,
    and then an intervening DM table reload's multipath_postsuspend()
    triggers flush_multipath_work().

    Fixes: 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
    Cc: stable@vger.kernel.org
    Reported-by: Ben Marzinski
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit f9e040efcc28309e5c592f7e79085a9a52e31f58 upstream.

    The function dax_direct_access doesn't take partitions into account,
    it always maps pages from the beginning of the device. Therefore,
    persistent_memory_claim() must get the partition offset using
    get_start_sect() and add it to the page offsets passed to
    dax_direct_access().

    Signed-off-by: Mikulas Patocka
    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # 4.18+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

26 Aug, 2020

1 commit

  • [ Upstream commit 65f0f017e7be8c70330372df23bcb2a407ecf02d ]

    For some block devices which large capacity (e.g. 8TB) but small io_opt
    size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu-
    lated by,
    DIV_ROUND_UP_ULL(sectors, d->stripe_size);
    might be overflow to the unsigned int bcache_device->nr_stripes.

    This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL()
    and after the value is checked to be available in unsigned int range,
    sets it to bache_device->nr_stripes. Then the overflow is avoided.

    Reported-and-tested-by: Ken Raeburn
    Signed-off-by: Coly Li
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     

21 Aug, 2020

5 commits

  • [ Upstream commit e8abe1de43dac658dacbd04a4543e0c988a8d386 ]

    The error handling calls md_bitmap_free(bitmap) which checks for NULL
    but will Oops if we pass an error pointer. Let's set "bitmap" to NULL
    on this error path.

    Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape")
    Signed-off-by: Dan Carpenter
    Reviewed-by: Guoqing Jiang
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Dan Carpenter
     
  • [ Upstream commit e766668c6cd49d741cfb49eaeb38998ba34d27bc ]

    dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't
    formally stop the blk-mq queue; therefore there is no point making the
    blk_mq_queue_stopped() check -- it will never be stopped.

    In addition, even though dm_stop_queue() actually tries to quiesce hw
    queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced()
    to avoid unnecessary queue quiesce isn't reliable because: the
    QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and
    dm_stop_queue() may be called when synchronize_rcu() from another
    blk_mq_quiesce_queue() is in-progress.

    Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Ming Lei
     
  • commit 7a1481267999c02abf4a624515c1b5c7c1fccbd6 upstream.

    offset_to_stripe() returns the stripe number (in type unsigned int) from
    an offset (in type uint64_t) by the following calculation,
    do_div(offset, d->stripe_size);
    For large capacity backing device (e.g. 18TB) with small stripe size
    (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
    returned value which caller receives is 536870912, due to the overflow.

    Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
    range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
    in range [0, bcache_dev->nr_stripes - 1].

    This patch adds a upper limition check in offset_to_stripe(): the max
    valid stripe number should be less than bcache_device->nr_stripes. If
    the calculated stripe number from do_div() is equal to or larger than
    bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
    is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
    therefore -EOVERFLOW is not used as error code.)

    This patch also changes nr_stripes' type of struct bcache_device from
    'unsigned int' to 'int', and return value type of offset_to_stripe()
    from 'unsigned int' to 'int', to match their exact data ranges.

    All locations where bcache_device->nr_stripes and offset_to_stripe() are
    referenced also get updated for the above type change.

    Reported-and-tested-by: Ken Raeburn
    Signed-off-by: Coly Li
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Coly Li
     
  • commit 5fe48867856367142d91a82f2cbf7a57a24cbb70 upstream.

    There are some meta data of bcache are allocated by multiple pages,
    and they are used as bio bv_page for I/Os to the cache device. for
    example cache_set->uuids, cache->disk_buckets, journal_write->data,
    bset_tree->data.

    For such meta data memory, all the allocated pages should be treated
    as a single memory block. Then the memory management and underlying I/O
    code can treat them more clearly.

    This patch adds __GFP_COMP flag to all the location allocating >0 order
    pages for the above mentioned meta data. Then their pages are treated
    as compound pages now.

    Signed-off-by: Coly Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Coly Li
     
  • commit a1c6ae3d9f3dd6aa5981a332a6f700cf1c25edef upstream.

    In degraded raid5, we need to read parity to do reconstruct-write when
    data disks fail. However, we can not read parity from
    handle_stripe_dirtying() in force reconstruct-write mode.

    Reproducible Steps:

    1. Create degraded raid5
    mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
    2. Set rmw_level to 0
    echo 0 > /sys/block/md2/md/rmw_level
    3. IO to raid5

    Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
    the parity in this situation.

    Cc: # v4.4+
    Reviewed-by: Alex Wu
    Reviewed-by: BingJing Chang
    Reviewed-by: Danny Shih
    Signed-off-by: ChangSyun Peng
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    ChangSyun Peng
     

19 Aug, 2020

3 commits

  • [ Upstream commit 117f636ea695270fe492d0c0c9dfadc7a662af47 ]

    In register_cache_set(), c is pointer to struct cache_set, and ca is
    pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
    registering cache has up to date version and other members, the in-
    memory version and other members should be updated to the newer value.

    But current implementation makes a cache set only has a single cache
    device, so the above assumption works well except for a special case.
    The execption is when a cache device new created and both ca->sb.seq and
    c->sb.seq are 0, because the super block is never flushed out yet. In
    the location for the following if() check,
    2156 if (ca->sb.seq > c->sb.seq) {
    2157 c->sb.version = ca->sb.version;
    2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
    2159 c->sb.flags = ca->sb.flags;
    2160 c->sb.seq = ca->sb.seq;
    2161 pr_debug("set version = %llu\n", c->sb.version);
    2162 }
    c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
    the if() check will fail (because both values are 0), and the cache set
    version, set_uuid, flags and seq won't be updated.

    The above problem is hiden for current code, because the bucket size is
    compatible among different super block version. And the next time when
    running cache set again, ca->sb.seq will be larger than 0 and cache set
    super block version will be updated properly.

    But if the large bucket feature is enabled, sb->bucket_size is the low
    16bits of the bucket size. For a power of 2 value, when the actual
    bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
    read_super_common() will fail because the if() check to
    is_power_of_2(sb->bucket_size) is false. This is how the long time
    hidden bug is triggered.

    This patch modifies the if() check to the following way,
    2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
    Then cache set's version, set_uuid, flags and seq will always be updated
    corectly including for a new created cache device.

    Signed-off-by: Coly Li
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit 60f80d6f2d07a6d8aee485a1d1252327eeee0c81 ]

    reproduction steps:
    ```
    node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
    /dev/sdb
    node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb
    node1 # mdadm -G /dev/md0 -b none
    mdadm: failed to remove clustered bitmap.
    node1 # mdadm -S --scan
    ^C
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Zhao Heming
     
  • [ Upstream commit 9a5a85972c073f720d81a7ebd08bfe278e6e16db ]

    Pointer mddev is being dereferenced with a test_bit call before mddev
    is being null checked, this may cause a null pointer dereference. Fix
    this by moving the null pointer checks to sanity check mddev before
    it is dereferenced.

    Addresses-Coverity: ("Dereference before null check")
    Fixes: 62f7b1989c02 ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
    Signed-off-by: Colin Ian King
    Reviewed-by: Guilherme G. Piccoli
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Colin Ian King
     

29 Jul, 2020

2 commits

  • commit 5df96f2b9f58a5d2dc1f30fe7de75e197f2c25f2 upstream.

    Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
    device during destroy") broke integrity recalculation.

    The problem is dm_suspended() returns true not only during suspend,
    but also during resume. So this race condition could occur:
    1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
    2. integrity_recalc (&ic->recalc_work) preempts the current thread
    3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
    4. integrity_recalc exits and no recalculating is done.

    To fix this race condition, add a function dm_post_suspending that is
    only true during the postsuspend phase and use it instead of
    dm_suspended().

    Signed-off-by: Mikulas Patocka
    Fixes: adc0daad366b ("dm: report suspended device during destroy")
    Cc: stable vger kernel org # v4.18+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • [ Upstream commit 382761dc6312965a11f82f2217e16ec421bf17ae ]

    bio_uninit is the proper API to clean up a BIO that has been allocated
    on stack or inside a structure that doesn't come from the BIO allocator.
    Switch dm to use that instead of bio_disassociate_blkg, which really is
    an implementation detail. Note that the bio_uninit calls are also moved
    to the two callers of __send_empty_flush, so that they better pair with
    the bio_init calls used to initialize them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     

16 Jul, 2020

2 commits

  • commit a46624580376a3a0beb218d94cbc7f258696e29f upstream.

    DM writecache does not handle asynchronous pmem. Reject it when
    supplied as cache.

    Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc.fsf@linux.ibm.com/
    Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
    Signed-off-by: Michal Suchanek
    Acked-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # 5.3+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Michal Suchanek
     
  • commit 6958c1c640af8c3f40fa8a2eee3b5b905d95b677 upstream.

    kobject_uevent may allocate memory and it may be called while there are dm
    devices suspended. The allocation may recurse into a suspended device,
    causing a deadlock. We must set the noio flag when sending a uevent.

    The observed deadlock was reported here:
    https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html

    Reported-by: Khazhismel Kumykov
    Reported-by: Tahsin Erdogan
    Reported-by: Gabriel Krisman Bertazi
    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

09 Jul, 2020

1 commit

  • commit 7b2377486767503d47265e4d487a63c651f6b55d upstream.

    The unit of max_io_len is sector instead of byte (spotted through
    code review), so fix it.

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     

01 Jul, 2020

2 commits


24 Jun, 2020

3 commits

  • [ Upstream commit be23e837333a914df3f24bf0b32e87b0331ab8d1 ]

    coccicheck reports:
    drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

    In btree_gc_coalesce func, if the coalescing process fails, we will goto
    to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
    Then, it will cause a deadlock when trying to acquire new_nodes[i]->
    write_lock for freeing new_nodes[i] before return.

    btree_gc_coalesce func details as follows:
    if alloc new_nodes[i] fails:
    goto out_nocoalesce;
    // obtain new_nodes[i]->write_lock
    mutex_lock(&new_nodes[i]->write_lock)
    // main coalescing process
    for (i = nodes - 1; i > 0; --i)
    [snipped]
    if coalescing process fails:
    // Here, directly goto out_nocoalesce
    // tag will cause a deadlock
    goto out_nocoalesce;
    [snipped]
    // release new_nodes[i]->write_lock
    mutex_unlock(&new_nodes[i]->write_lock)
    // coalesing succ, return
    return;
    out_nocoalesce:
    btree_node_free(new_nodes[i]) // free new_nodes[i]
    // obtain new_nodes[i]->write_lock
    mutex_lock(&new_nodes[i]->write_lock);
    // set flag for reuse
    clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
    // release new_nodes[i]->write_lock
    mutex_unlock(&new_nodes[i]->write_lock);

    To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
    releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
    coalescing process fails, we will go to out_unlock_nocoalesce tag
    for releasing new_nodes[i]->write_lock before free new_nodes[i] in
    out_nocoalesce tag.

    (Coly Li helps to clean up commit log format.)

    Fixes: 2a285686c109816 ("bcache: btree locking rework")
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     
  • [ Upstream commit 489dc0f06a5837f87482c0ce61d830d24e17082e ]

    The only case where dmz_get_zone_for_reclaim() cannot return a zone is
    if the respective lists are empty. So we should just return a simple
    NULL value here as we really don't have an error code which would make
    sense.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Hannes Reinecke
     
  • [ Upstream commit 2361ae595352dec015d14292f1b539242d8446d6 ]

    SCSI LUN passthrough code such as qemu's "scsi-block" device model
    pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
    calls choose_pgpath() only in the block IO code path, not in the ioctl
    code path (unless current_pgpath is NULL). This has the effect that no
    path switching and thus no load balancing is done for SCSI-passthrough
    IO, unless the active path fails.

    Fix this by using the same logic in multipath_prepare_ioctl() as in
    multipath_clone_and_map().

    Note: The allegedly best path selection algorithm, service-time,
    still wouldn't work perfectly, because the io size of the current
    request is always set to 0. Changing that for the IO passthrough
    case would require the ioctl cmd and arg to be passed to dm's
    prepare_ioctl() method.

    Signed-off-by: Martin Wilck
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Martin Wilck
     

22 Jun, 2020

4 commits

  • commit 64611a15ca9da91ff532982429c44686f4593b5f upstream.

    queue_limits::logical_block_size got changed from unsigned short to
    unsigned int, but it was forgotten to update crypt_io_hints() to use the
    new type. Fix it.

    Fixes: ad6bf88a6c19 ("block: fix an integer overflow in logical block size")
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Biggers
    Reviewed-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • [ Upstream commit 86da9f736740eba602389908574dfbb0f517baa5 ]

    The problematic code piece in bcache_device_free() is,

    785 static void bcache_device_free(struct bcache_device *d)
    786 {
    787 struct gendisk *disk = d->disk;
    [snipped]
    799 if (disk) {
    800 if (disk->flags & GENHD_FL_UP)
    801 del_gendisk(disk);
    802
    803 if (disk->queue)
    804 blk_cleanup_queue(disk->queue);
    805
    806 ida_simple_remove(&bcache_device_idx,
    807 first_minor_to_idx(disk->first_minor));
    808 put_disk(disk);
    809 }
    [snipped]
    816 }

    At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
    being underflow.

    Here is how to reproduce the issue,
    - Attche the backing device to a cache device and do random write to
    make the cache being dirty.
    - Stop the bcache device while the cache device has dirty data of the
    backing device.
    - Only register the backing device back, NOT register cache device.
    - The bcache device node /dev/bcache0 won't show up, because backing
    device waits for the cache device shows up for the missing dirty
    data.
    - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
    backing device.
    - After the pending backing device stopped, use 'dmesg' to check kernel
    message, a use-after-free warning from KASA reported the refcount of
    kobject linked to the 'disk' is underflow.

    The dropping refcount at line 808 in the above code piece is added by
    add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
    the cache device is not registered, bch_cached_dev_run() has no chance
    to be called and the refcount is not added. The put_disk() for a non-
    added refcount of gendisk kobject triggers a underflow warning.

    This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
    not set then the bcache device was not added, don't call put_disk()
    and the the underflow issue can be avoided.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit ba54d4d4d2844c234f1b4692bd8c9e0f833c8a54 ]

    Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
    not have the expected behavior. kvmalloc_array() inside scribble_alloc()
    which receives the GFP_NOIO flag will eventually call kmalloc_node() to
    allocate physically continuous pages.

    Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
    prevent memory reclaim I/Os during raid array suspend context, calling
    to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
    I/O as expected.

    This patch removes the useless gfp flags from parameters list of
    scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
    incorrect GFP_NOIO flag does not exist anymore.

    Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
    Suggested-by: Michal Hocko
    Signed-off-by: Coly Li
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae ]

    We need to check mddev->del_work before flush workqueu since the purpose
    of flush is to ensure the previous md is disappeared. Otherwise the similar
    deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
    bdev->bd_mutex before flush workqueue.

    kernel: [ 154.522645] ======================================================
    kernel: [ 154.522647] WARNING: possible circular locking dependency detected
    kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
    kernel: [ 154.522651] ------------------------------------------------------
    kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
    kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
    kernel: [ 154.522673]
    kernel: [ 154.522673] but task is already holding lock:
    kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
    kernel: [ 154.522691]
    kernel: [ 154.522691] which lock already depends on the new lock.
    kernel: [ 154.522691]
    kernel: [ 154.522694]
    kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
    kernel: [ 154.522696]
    kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
    kernel: [ 154.522704] __mutex_lock+0x87/0x950
    kernel: [ 154.522706] __blkdev_get+0x79/0x590
    kernel: [ 154.522708] blkdev_get+0x65/0x140
    kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
    kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
    kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
    kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
    kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
    kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
    kernel: [ 154.522735] vfs_write+0xad/0x1a0
    kernel: [ 154.522737] ksys_write+0xa4/0xe0
    kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522749]
    kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
    kernel: [ 154.522752] __mutex_lock+0x87/0x950
    kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
    kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
    kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
    kernel: [ 154.522763] vfs_write+0xad/0x1a0
    kernel: [ 154.522765] ksys_write+0xa4/0xe0
    kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522770]
    kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
    kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
    kernel: [ 154.522778] kernfs_remove+0x1f/0x30
    kernel: [ 154.522780] kobject_del+0x28/0x60
    kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
    kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
    kernel: [ 154.522788] worker_thread+0x2d/0x3d0
    kernel: [ 154.522793] kthread+0x117/0x130
    kernel: [ 154.522795] ret_from_fork+0x3a/0x50
    kernel: [ 154.522796]
    kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
    kernel: [ 154.522800] process_one_work+0x27e/0x5f0
    kernel: [ 154.522802] worker_thread+0x2d/0x3d0
    kernel: [ 154.522804] kthread+0x117/0x130
    kernel: [ 154.522806] ret_from_fork+0x3a/0x50
    kernel: [ 154.522807]
    kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
    kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
    kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
    kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
    kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522823] __blkdev_get+0xea/0x590
    kernel: [ 154.522825] blkdev_get+0x65/0x140
    kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
    kernel: [ 154.522831] path_openat+0x567/0xcc0
    kernel: [ 154.522834] do_filp_open+0x9b/0x110
    kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522838] do_sys_open+0x57/0x80
    kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522844]
    kernel: [ 154.522844] other info that might help us debug this:
    kernel: [ 154.522844]
    kernel: [ 154.522846] Chain exists of:
    kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
    kernel: [ 154.522846]
    kernel: [ 154.522850] Possible unsafe locking scenario:
    kernel: [ 154.522850]
    kernel: [ 154.522852] CPU0 CPU1
    kernel: [ 154.522853] ---- ----
    kernel: [ 154.522854] lock(&bdev->bd_mutex);
    kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
    kernel: [ 154.522858] lock(&bdev->bd_mutex);
    kernel: [ 154.522860] lock((wq_completion)md_misc);
    kernel: [ 154.522861]
    kernel: [ 154.522861] *** DEADLOCK ***
    kernel: [ 154.522861]
    kernel: [ 154.522864] 1 lock held by mdadm/2482:
    kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
    kernel: [ 154.522868]
    kernel: [ 154.522868] stack backtrace:
    kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
    kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    kernel: [ 154.522878] Call Trace:
    kernel: [ 154.522881] dump_stack+0x8f/0xcb
    kernel: [ 154.522884] check_noncircular+0x194/0x1b0
    kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
    kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
    kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
    kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
    kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
    kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
    kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522910] __blkdev_get+0xea/0x590
    kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
    kernel: [ 154.522914] blkdev_get+0x65/0x140
    kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
    kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
    kernel: [ 154.522921] path_openat+0x567/0xcc0
    kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
    kernel: [ 154.522926] do_filp_open+0x9b/0x110
    kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
    kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
    kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522944] do_sys_open+0x57/0x80
    kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae

    And md_alloc also flushed the same workqueue, but the thing is different
    here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
    the flush is necessary to avoid race condition, so leave it as it is.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Guoqing Jiang
     

06 May, 2020

3 commits

  • commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.

    When adding devices that don't have a scsi_dh on a BIO based multipath,
    I was able to consistently hit the warning below and lock-up the system.

    The problem is that __map_bio reads the flag before it potentially being
    modified by choose_pgpath, and ends up using the older value.

    The WARN_ON below is not trivially linked to the issue. It goes like
    this: The activate_path delayed_work is not initialized for non-scsi_dh
    devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
    That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
    Nevertheless, only for BIO-based mpath, we cache the flag before calling
    choose_pgpath, and use the older version when deciding if we should
    initialize the path. Therefore, we end up trying to initialize the
    paths, and calling the non-initialized activate_path work.

    [ 82.437100] ------------[ cut here ]------------
    [ 82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
    __queue_delayed_work+0x71/0x90
    [ 82.438436] Modules linked in:
    [ 82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
    [ 82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
    [ 82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
    94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
    a7 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
    [ 82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
    [ 82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
    [ 82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
    [ 82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
    [ 82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
    [ 82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
    [ 82.445427] FS: 00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
    [ 82.446040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
    [ 82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 82.448012] Call Trace:
    [ 82.448164] queue_delayed_work_on+0x6d/0x80
    [ 82.448472] __pg_init_all_paths+0x7b/0xf0
    [ 82.448714] pg_init_all_paths+0x26/0x40
    [ 82.448980] __multipath_map_bio.isra.0+0x84/0x210
    [ 82.449267] __map_bio+0x3c/0x1f0
    [ 82.449468] __split_and_process_non_flush+0x14a/0x1b0
    [ 82.449775] __split_and_process_bio+0xde/0x340
    [ 82.450045] ? dm_get_live_table+0x5/0xb0
    [ 82.450278] dm_process_bio+0x98/0x290
    [ 82.450518] dm_make_request+0x54/0x120
    [ 82.450778] generic_make_request+0xd2/0x3e0
    [ 82.451038] ? submit_bio+0x3c/0x150
    [ 82.451278] submit_bio+0x3c/0x150
    [ 82.451492] mpage_readpages+0x129/0x160
    [ 82.451756] ? bdev_evict_inode+0x1d0/0x1d0
    [ 82.452033] read_pages+0x72/0x170
    [ 82.452260] __do_page_cache_readahead+0x1ba/0x1d0
    [ 82.452624] force_page_cache_readahead+0x96/0x110
    [ 82.452903] generic_file_read_iter+0x84f/0xae0
    [ 82.453192] ? __seccomp_filter+0x7c/0x670
    [ 82.453547] new_sync_read+0x10e/0x190
    [ 82.453883] vfs_read+0x9d/0x150
    [ 82.454172] ksys_read+0x65/0xe0
    [ 82.454466] do_syscall_64+0x4e/0x210
    [ 82.454828] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [...]
    [ 82.462501] ---[ end trace bb39975e9cf45daa ]---

    Cc: stable@vger.kernel.org
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     
  • commit 31b22120194b5c0d460f59e0c98504de1d3f1f14 upstream.

    The dm-writecache reads metadata in the target constructor. However, when
    we reload the target, there could be another active instance running on
    the same device. This is the sequence of operations when doing a reload:

    1. construct new target
    2. suspend old target
    3. resume new target
    4. destroy old target

    Metadata that were written by the old target between steps 1 and 2 would
    not be visible by the new target.

    Fix the data corruption by loading the metadata in the resume handler.

    Also, validate block_size is at least as large as both the devices'
    logical block size and only read 1 block from the metadata during
    target constructor -- no need to read entirety of metadata now that it
    is done during resume.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # v4.18+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit ad4e80a639fc61d5ecebb03caa5cdbfb91fcebfc upstream.

    The error correction data is computed as if data and hash blocks
    were concatenated. But hash block number starts from v->hash_start.
    So, we have to calculate hash block number based on that.

    Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sunwook Eom
    Reviewed-by: Sami Tolvanen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Sunwook Eom
     

17 Apr, 2020

2 commits

  • [ Upstream commit 9fc06ff56845cc5ccafec52f545fc2e08d22f849 ]

    Add missing casts when converting from regions to sectors.

    In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
    to overflows and miscalculation of the device sector.

    As a result, we could end up discarding and/or copying the wrong parts
    of the device, thus corrupting the device's data.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit 4b5142905d4ff58a4b93f7c8eaa7ba829c0a53c9 ]

    There is a bug in the way dm-clone handles discards, which can lead to
    discarding the wrong blocks or trying to discard blocks beyond the end
    of the device.

    This could lead to data corruption, if the destination device indeed
    discards the underlying blocks, i.e., if the discard operation results
    in the original contents of a block to be lost.

    The root of the problem is the code that calculates the range of regions
    covered by a discard request and decides which regions to discard.

    Since dm-clone handles the device in units of regions, we don't discard
    parts of a region, only whole regions.

    The range is calculated as:

    rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
    re = bio_end_sector(bio) >> clone->region_shift;

    , where 'rs' is the first region to discard and (re - rs) is the number
    of regions to discard.

    The bug manifests when we try to discard part of a single region, i.e.,
    when we try to discard a block with size < region_size, and the discard
    request both starts at an offset with respect to the beginning of that
    region and ends before the end of the region.

    The root cause is the following comparison:

    if (rs == re)
    // skip discard and complete original bio immediately

    , which doesn't take into account that 'rs' might be greater than 're'.

    Thus, we then issue a discard request for the wrong blocks, instead of
    skipping the discard all together.

    Fix the check to also take into account the above case, so we don't end
    up discarding the wrong blocks.

    Also, add some range checks to dm_clone_set_region_hydrated() and
    dm_clone_cond_set_range(), which update dm-clone's region bitmap.

    Note that the aforementioned bug doesn't cause invalid memory accesses,
    because dm_clone_is_range_hydrated() returns True for this case, so the
    checks are just precautionary.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis