29 Jul, 2020

2 commits

  • commit 5df96f2b9f58a5d2dc1f30fe7de75e197f2c25f2 upstream.

    Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
    device during destroy") broke integrity recalculation.

    The problem is dm_suspended() returns true not only during suspend,
    but also during resume. So this race condition could occur:
    1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
    2. integrity_recalc (&ic->recalc_work) preempts the current thread
    3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
    4. integrity_recalc exits and no recalculating is done.

    To fix this race condition, add a function dm_post_suspending that is
    only true during the postsuspend phase and use it instead of
    dm_suspended().

    Signed-off-by: Mikulas Patocka
    Fixes: adc0daad366b ("dm: report suspended device during destroy")
    Cc: stable vger kernel org # v4.18+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • [ Upstream commit 382761dc6312965a11f82f2217e16ec421bf17ae ]

    bio_uninit is the proper API to clean up a BIO that has been allocated
    on stack or inside a structure that doesn't come from the BIO allocator.
    Switch dm to use that instead of bio_disassociate_blkg, which really is
    an implementation detail. Note that the bio_uninit calls are also moved
    to the two callers of __send_empty_flush, so that they better pair with
    the bio_init calls used to initialize them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     

16 Jul, 2020

2 commits

  • commit a46624580376a3a0beb218d94cbc7f258696e29f upstream.

    DM writecache does not handle asynchronous pmem. Reject it when
    supplied as cache.

    Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc.fsf@linux.ibm.com/
    Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
    Signed-off-by: Michal Suchanek
    Acked-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # 5.3+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Michal Suchanek
     
  • commit 6958c1c640af8c3f40fa8a2eee3b5b905d95b677 upstream.

    kobject_uevent may allocate memory and it may be called while there are dm
    devices suspended. The allocation may recurse into a suspended device,
    causing a deadlock. We must set the noio flag when sending a uevent.

    The observed deadlock was reported here:
    https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html

    Reported-by: Khazhismel Kumykov
    Reported-by: Tahsin Erdogan
    Reported-by: Gabriel Krisman Bertazi
    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

09 Jul, 2020

1 commit

  • commit 7b2377486767503d47265e4d487a63c651f6b55d upstream.

    The unit of max_io_len is sector instead of byte (spotted through
    code review), so fix it.

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     

01 Jul, 2020

2 commits


24 Jun, 2020

3 commits

  • [ Upstream commit be23e837333a914df3f24bf0b32e87b0331ab8d1 ]

    coccicheck reports:
    drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

    In btree_gc_coalesce func, if the coalescing process fails, we will goto
    to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
    Then, it will cause a deadlock when trying to acquire new_nodes[i]->
    write_lock for freeing new_nodes[i] before return.

    btree_gc_coalesce func details as follows:
    if alloc new_nodes[i] fails:
    goto out_nocoalesce;
    // obtain new_nodes[i]->write_lock
    mutex_lock(&new_nodes[i]->write_lock)
    // main coalescing process
    for (i = nodes - 1; i > 0; --i)
    [snipped]
    if coalescing process fails:
    // Here, directly goto out_nocoalesce
    // tag will cause a deadlock
    goto out_nocoalesce;
    [snipped]
    // release new_nodes[i]->write_lock
    mutex_unlock(&new_nodes[i]->write_lock)
    // coalesing succ, return
    return;
    out_nocoalesce:
    btree_node_free(new_nodes[i]) // free new_nodes[i]
    // obtain new_nodes[i]->write_lock
    mutex_lock(&new_nodes[i]->write_lock);
    // set flag for reuse
    clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
    // release new_nodes[i]->write_lock
    mutex_unlock(&new_nodes[i]->write_lock);

    To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
    releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
    coalescing process fails, we will go to out_unlock_nocoalesce tag
    for releasing new_nodes[i]->write_lock before free new_nodes[i] in
    out_nocoalesce tag.

    (Coly Li helps to clean up commit log format.)

    Fixes: 2a285686c109816 ("bcache: btree locking rework")
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     
  • [ Upstream commit 489dc0f06a5837f87482c0ce61d830d24e17082e ]

    The only case where dmz_get_zone_for_reclaim() cannot return a zone is
    if the respective lists are empty. So we should just return a simple
    NULL value here as we really don't have an error code which would make
    sense.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Hannes Reinecke
     
  • [ Upstream commit 2361ae595352dec015d14292f1b539242d8446d6 ]

    SCSI LUN passthrough code such as qemu's "scsi-block" device model
    pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
    calls choose_pgpath() only in the block IO code path, not in the ioctl
    code path (unless current_pgpath is NULL). This has the effect that no
    path switching and thus no load balancing is done for SCSI-passthrough
    IO, unless the active path fails.

    Fix this by using the same logic in multipath_prepare_ioctl() as in
    multipath_clone_and_map().

    Note: The allegedly best path selection algorithm, service-time,
    still wouldn't work perfectly, because the io size of the current
    request is always set to 0. Changing that for the IO passthrough
    case would require the ioctl cmd and arg to be passed to dm's
    prepare_ioctl() method.

    Signed-off-by: Martin Wilck
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Martin Wilck
     

22 Jun, 2020

4 commits

  • commit 64611a15ca9da91ff532982429c44686f4593b5f upstream.

    queue_limits::logical_block_size got changed from unsigned short to
    unsigned int, but it was forgotten to update crypt_io_hints() to use the
    new type. Fix it.

    Fixes: ad6bf88a6c19 ("block: fix an integer overflow in logical block size")
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Biggers
    Reviewed-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • [ Upstream commit 86da9f736740eba602389908574dfbb0f517baa5 ]

    The problematic code piece in bcache_device_free() is,

    785 static void bcache_device_free(struct bcache_device *d)
    786 {
    787 struct gendisk *disk = d->disk;
    [snipped]
    799 if (disk) {
    800 if (disk->flags & GENHD_FL_UP)
    801 del_gendisk(disk);
    802
    803 if (disk->queue)
    804 blk_cleanup_queue(disk->queue);
    805
    806 ida_simple_remove(&bcache_device_idx,
    807 first_minor_to_idx(disk->first_minor));
    808 put_disk(disk);
    809 }
    [snipped]
    816 }

    At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
    being underflow.

    Here is how to reproduce the issue,
    - Attche the backing device to a cache device and do random write to
    make the cache being dirty.
    - Stop the bcache device while the cache device has dirty data of the
    backing device.
    - Only register the backing device back, NOT register cache device.
    - The bcache device node /dev/bcache0 won't show up, because backing
    device waits for the cache device shows up for the missing dirty
    data.
    - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
    backing device.
    - After the pending backing device stopped, use 'dmesg' to check kernel
    message, a use-after-free warning from KASA reported the refcount of
    kobject linked to the 'disk' is underflow.

    The dropping refcount at line 808 in the above code piece is added by
    add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
    the cache device is not registered, bch_cached_dev_run() has no chance
    to be called and the refcount is not added. The put_disk() for a non-
    added refcount of gendisk kobject triggers a underflow warning.

    This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
    not set then the bcache device was not added, don't call put_disk()
    and the the underflow issue can be avoided.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit ba54d4d4d2844c234f1b4692bd8c9e0f833c8a54 ]

    Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
    not have the expected behavior. kvmalloc_array() inside scribble_alloc()
    which receives the GFP_NOIO flag will eventually call kmalloc_node() to
    allocate physically continuous pages.

    Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
    prevent memory reclaim I/Os during raid array suspend context, calling
    to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
    I/O as expected.

    This patch removes the useless gfp flags from parameters list of
    scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
    incorrect GFP_NOIO flag does not exist anymore.

    Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
    Suggested-by: Michal Hocko
    Signed-off-by: Coly Li
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae ]

    We need to check mddev->del_work before flush workqueu since the purpose
    of flush is to ensure the previous md is disappeared. Otherwise the similar
    deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
    bdev->bd_mutex before flush workqueue.

    kernel: [ 154.522645] ======================================================
    kernel: [ 154.522647] WARNING: possible circular locking dependency detected
    kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
    kernel: [ 154.522651] ------------------------------------------------------
    kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
    kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
    kernel: [ 154.522673]
    kernel: [ 154.522673] but task is already holding lock:
    kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
    kernel: [ 154.522691]
    kernel: [ 154.522691] which lock already depends on the new lock.
    kernel: [ 154.522691]
    kernel: [ 154.522694]
    kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
    kernel: [ 154.522696]
    kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
    kernel: [ 154.522704] __mutex_lock+0x87/0x950
    kernel: [ 154.522706] __blkdev_get+0x79/0x590
    kernel: [ 154.522708] blkdev_get+0x65/0x140
    kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
    kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
    kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
    kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
    kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
    kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
    kernel: [ 154.522735] vfs_write+0xad/0x1a0
    kernel: [ 154.522737] ksys_write+0xa4/0xe0
    kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522749]
    kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
    kernel: [ 154.522752] __mutex_lock+0x87/0x950
    kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
    kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
    kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
    kernel: [ 154.522763] vfs_write+0xad/0x1a0
    kernel: [ 154.522765] ksys_write+0xa4/0xe0
    kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522770]
    kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
    kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
    kernel: [ 154.522778] kernfs_remove+0x1f/0x30
    kernel: [ 154.522780] kobject_del+0x28/0x60
    kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
    kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
    kernel: [ 154.522788] worker_thread+0x2d/0x3d0
    kernel: [ 154.522793] kthread+0x117/0x130
    kernel: [ 154.522795] ret_from_fork+0x3a/0x50
    kernel: [ 154.522796]
    kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
    kernel: [ 154.522800] process_one_work+0x27e/0x5f0
    kernel: [ 154.522802] worker_thread+0x2d/0x3d0
    kernel: [ 154.522804] kthread+0x117/0x130
    kernel: [ 154.522806] ret_from_fork+0x3a/0x50
    kernel: [ 154.522807]
    kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
    kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
    kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
    kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
    kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522823] __blkdev_get+0xea/0x590
    kernel: [ 154.522825] blkdev_get+0x65/0x140
    kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
    kernel: [ 154.522831] path_openat+0x567/0xcc0
    kernel: [ 154.522834] do_filp_open+0x9b/0x110
    kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522838] do_sys_open+0x57/0x80
    kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522844]
    kernel: [ 154.522844] other info that might help us debug this:
    kernel: [ 154.522844]
    kernel: [ 154.522846] Chain exists of:
    kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
    kernel: [ 154.522846]
    kernel: [ 154.522850] Possible unsafe locking scenario:
    kernel: [ 154.522850]
    kernel: [ 154.522852] CPU0 CPU1
    kernel: [ 154.522853] ---- ----
    kernel: [ 154.522854] lock(&bdev->bd_mutex);
    kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
    kernel: [ 154.522858] lock(&bdev->bd_mutex);
    kernel: [ 154.522860] lock((wq_completion)md_misc);
    kernel: [ 154.522861]
    kernel: [ 154.522861] *** DEADLOCK ***
    kernel: [ 154.522861]
    kernel: [ 154.522864] 1 lock held by mdadm/2482:
    kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
    kernel: [ 154.522868]
    kernel: [ 154.522868] stack backtrace:
    kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
    kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    kernel: [ 154.522878] Call Trace:
    kernel: [ 154.522881] dump_stack+0x8f/0xcb
    kernel: [ 154.522884] check_noncircular+0x194/0x1b0
    kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
    kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
    kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
    kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
    kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
    kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
    kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
    kernel: [ 154.522910] __blkdev_get+0xea/0x590
    kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
    kernel: [ 154.522914] blkdev_get+0x65/0x140
    kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
    kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
    kernel: [ 154.522921] path_openat+0x567/0xcc0
    kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
    kernel: [ 154.522926] do_filp_open+0x9b/0x110
    kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
    kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
    kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
    kernel: [ 154.522944] do_sys_open+0x57/0x80
    kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
    kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae

    And md_alloc also flushed the same workqueue, but the thing is different
    here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
    the flush is necessary to avoid race condition, so leave it as it is.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Guoqing Jiang
     

06 May, 2020

3 commits

  • commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.

    When adding devices that don't have a scsi_dh on a BIO based multipath,
    I was able to consistently hit the warning below and lock-up the system.

    The problem is that __map_bio reads the flag before it potentially being
    modified by choose_pgpath, and ends up using the older value.

    The WARN_ON below is not trivially linked to the issue. It goes like
    this: The activate_path delayed_work is not initialized for non-scsi_dh
    devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
    That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
    Nevertheless, only for BIO-based mpath, we cache the flag before calling
    choose_pgpath, and use the older version when deciding if we should
    initialize the path. Therefore, we end up trying to initialize the
    paths, and calling the non-initialized activate_path work.

    [ 82.437100] ------------[ cut here ]------------
    [ 82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
    __queue_delayed_work+0x71/0x90
    [ 82.438436] Modules linked in:
    [ 82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
    [ 82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
    [ 82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
    94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
    a7 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
    [ 82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
    [ 82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
    [ 82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
    [ 82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
    [ 82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
    [ 82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
    [ 82.445427] FS: 00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
    [ 82.446040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
    [ 82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 82.448012] Call Trace:
    [ 82.448164] queue_delayed_work_on+0x6d/0x80
    [ 82.448472] __pg_init_all_paths+0x7b/0xf0
    [ 82.448714] pg_init_all_paths+0x26/0x40
    [ 82.448980] __multipath_map_bio.isra.0+0x84/0x210
    [ 82.449267] __map_bio+0x3c/0x1f0
    [ 82.449468] __split_and_process_non_flush+0x14a/0x1b0
    [ 82.449775] __split_and_process_bio+0xde/0x340
    [ 82.450045] ? dm_get_live_table+0x5/0xb0
    [ 82.450278] dm_process_bio+0x98/0x290
    [ 82.450518] dm_make_request+0x54/0x120
    [ 82.450778] generic_make_request+0xd2/0x3e0
    [ 82.451038] ? submit_bio+0x3c/0x150
    [ 82.451278] submit_bio+0x3c/0x150
    [ 82.451492] mpage_readpages+0x129/0x160
    [ 82.451756] ? bdev_evict_inode+0x1d0/0x1d0
    [ 82.452033] read_pages+0x72/0x170
    [ 82.452260] __do_page_cache_readahead+0x1ba/0x1d0
    [ 82.452624] force_page_cache_readahead+0x96/0x110
    [ 82.452903] generic_file_read_iter+0x84f/0xae0
    [ 82.453192] ? __seccomp_filter+0x7c/0x670
    [ 82.453547] new_sync_read+0x10e/0x190
    [ 82.453883] vfs_read+0x9d/0x150
    [ 82.454172] ksys_read+0x65/0xe0
    [ 82.454466] do_syscall_64+0x4e/0x210
    [ 82.454828] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [...]
    [ 82.462501] ---[ end trace bb39975e9cf45daa ]---

    Cc: stable@vger.kernel.org
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     
  • commit 31b22120194b5c0d460f59e0c98504de1d3f1f14 upstream.

    The dm-writecache reads metadata in the target constructor. However, when
    we reload the target, there could be another active instance running on
    the same device. This is the sequence of operations when doing a reload:

    1. construct new target
    2. suspend old target
    3. resume new target
    4. destroy old target

    Metadata that were written by the old target between steps 1 and 2 would
    not be visible by the new target.

    Fix the data corruption by loading the metadata in the resume handler.

    Also, validate block_size is at least as large as both the devices'
    logical block size and only read 1 block from the metadata during
    target constructor -- no need to read entirety of metadata now that it
    is done during resume.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # v4.18+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit ad4e80a639fc61d5ecebb03caa5cdbfb91fcebfc upstream.

    The error correction data is computed as if data and hash blocks
    were concatenated. But hash block number starts from v->hash_start.
    So, we have to calculate hash block number based on that.

    Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sunwook Eom
    Reviewed-by: Sami Tolvanen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Sunwook Eom
     

17 Apr, 2020

10 commits

  • [ Upstream commit 9fc06ff56845cc5ccafec52f545fc2e08d22f849 ]

    Add missing casts when converting from regions to sectors.

    In case BITS_PER_LONG == 32, the lack of the appropriate casts can lead
    to overflows and miscalculation of the device sector.

    As a result, we could end up discarding and/or copying the wrong parts
    of the device, thus corrupting the device's data.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit 4b5142905d4ff58a4b93f7c8eaa7ba829c0a53c9 ]

    There is a bug in the way dm-clone handles discards, which can lead to
    discarding the wrong blocks or trying to discard blocks beyond the end
    of the device.

    This could lead to data corruption, if the destination device indeed
    discards the underlying blocks, i.e., if the discard operation results
    in the original contents of a block to be lost.

    The root of the problem is the code that calculates the range of regions
    covered by a discard request and decides which regions to discard.

    Since dm-clone handles the device in units of regions, we don't discard
    parts of a region, only whole regions.

    The range is calculated as:

    rs = dm_sector_div_up(bio->bi_iter.bi_sector, clone->region_size);
    re = bio_end_sector(bio) >> clone->region_shift;

    , where 'rs' is the first region to discard and (re - rs) is the number
    of regions to discard.

    The bug manifests when we try to discard part of a single region, i.e.,
    when we try to discard a block with size < region_size, and the discard
    request both starts at an offset with respect to the beginning of that
    region and ends before the end of the region.

    The root cause is the following comparison:

    if (rs == re)
    // skip discard and complete original bio immediately

    , which doesn't take into account that 'rs' might be greater than 're'.

    Thus, we then issue a discard request for the wrong blocks, instead of
    skipping the discard all together.

    Fix the check to also take into account the above case, so we don't end
    up discarding the wrong blocks.

    Also, add some range checks to dm_clone_set_region_hydrated() and
    dm_clone_cond_set_range(), which update dm-clone's region bitmap.

    Note that the aforementioned bug doesn't cause invalid memory accesses,
    because dm_clone_is_range_hydrated() returns True for this case, so the
    checks are just precautionary.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit 6ca43ed8376a51afec790dd484a51804ade4352a ]

    If we are in a place where it is known that interrupts are enabled,
    functions spin_lock_irq/spin_unlock_irq should be used instead of
    spin_lock_irqsave/spin_unlock_irqrestore.

    spin_lock_irq and spin_unlock_irq are faster because they don't need to
    push and pop the flags register.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Mikulas Patocka
     
  • [ Upstream commit b8fdd090376a7a46d17db316638fe54b965c2fb0 ]

    zmd->nr_rnd_zones was increased twice by mistake. The other place it
    is increased in dmz_init_zone() is the only one needed:

    1131 zmd->nr_useable_zones++;
    1132 if (dmz_is_rnd(zone)) {
    1133 zmd->nr_rnd_zones++;
    ^^^
    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Bob Liu
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Bob Liu
     
  • commit 81d5553d1288c2ec0390f02f84d71ca0f0f9f137 upstream.

    dm_clone_nr_of_hydrated_regions() returns the number of regions that
    have been hydrated so far. In order to do so it employs bitmap_weight().

    Until now, the return type of dm_clone_nr_of_hydrated_regions() was
    unsigned long.

    Because bitmap_weight() returns an int, in case BITS_PER_LONG == 64 and
    the return value of bitmap_weight() is 2^31 (the maximum allowed number
    of regions for a device), the result is sign extended from 32 bits to 64
    bits and an incorrect value is displayed, in the status output of
    dm-clone, as the number of hydrated regions.

    Fix this by having dm_clone_nr_of_hydrated_regions() return an unsigned
    int.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Nikos Tsironis
     
  • commit cd481c12269b4d276f1a52eda0ebd419079bfe3a upstream.

    Add overflow check for clone->nr_regions variable, which holds the
    number of regions of the target.

    The overflow can occur with sufficiently large devices, if BITS_PER_LONG
    == 32. E.g., if the region size is 8 sectors (4K), the overflow would
    occur for device sizes > 34359738360 sectors (~16TB).

    This could result in multiple device sectors wrongly mapping to the same
    region number, due to the truncation from 64 bits to 32 bits, which
    would lead to data corruption.

    Fixes: 7431b7835f55 ("dm: add clone target")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Nikos Tsironis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Nikos Tsironis
     
  • commit 75fa601934fda23d2f15bf44b09c2401942d8e15 upstream.

    Fix below kmemleak detected in verity_fec_ctr. output_pool is
    allocated for each dm-verity-fec device. But it is not freed when
    dm-table for the verity target is removed. Hence free the output
    mempool in destructor function verity_fec_dtr.

    unreferenced object 0xffffffffa574d000 (size 4096):
    comm "init", pid 1667, jiffies 4294894890 (age 307.168s)
    hex dump (first 32 bytes):
    8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00 .6..f...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] __kmalloc+0x2b4/0x340
    [] mempool_kmalloc+0x18/0x20
    [] mempool_init_node+0x98/0x118
    [] mempool_init+0x14/0x20
    [] verity_fec_ctr+0x388/0x3b0
    [] verity_ctr+0x87c/0x8d0
    [] dm_table_add_target+0x174/0x348
    [] table_load+0xe4/0x328
    [] dm_ctl_ioctl+0x3b4/0x5a0
    [] do_vfs_ioctl+0x5dc/0x928
    [] __arm64_sys_ioctl+0x70/0x98
    [] el0_svc_common+0xa0/0x158
    [] el0_svc_handler+0x6c/0x88
    [] el0_svc+0x8/0xc
    [] 0xffffffffffffffff

    Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
    Depends-on: 6f1c819c219f7 ("dm: convert to bioset_init()/mempool_init()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Harshini Shetty
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Shetty, Harshini X (EXT-Sony Mobile)
     
  • commit b93b6643e9b5a7f260b931e97f56ffa3fa65e26d upstream.

    If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
    there's a crash in integrity_metadata().

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 1edaa447d958bec24c6a79685a5790d98976fd16 upstream.

    Initializing a dm-writecache device can take a long time when the
    persistent memory device is large. Add cond_resched() to a few loops
    to avoid warnings that the CPU is stuck.

    Cc: stable@vger.kernel.org # v4.18+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • [ Upstream commit 6b40bec3b13278d21fa6c1ae7a0bdf2e550eed5f ]

    Don't call quiesce(1) and quiesce(0) if array is already suspended,
    otherwise in level_store, the array is writable after mddev_detach
    in below part though the intention is to make array writable after
    resume.

    mddev_suspend(mddev);
    mddev_detach(mddev);
    ...
    mddev_resume(mddev);

    And it also causes calltrace as follows in [1].

    [48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
    [...]
    [48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
    [48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
    [48005.653984] RIP: 0010:kthread_park+0x77/0x90
    [48005.654015] Call Trace:
    [48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
    [48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
    [48005.654073] mddev_detach+0x30/0x70 [md_mod]
    [48005.654090] level_store+0x202/0x670 [md_mod]
    [48005.654099] ? security_capable+0x40/0x60
    [48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
    [48005.654123] kernfs_fop_write+0xce/0x1b0
    [48005.654132] vfs_write+0xb6/0x1a0
    [48005.654138] ksys_write+0x67/0xe0
    [48005.654146] do_syscall_64+0x4e/0x140
    [48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [48005.654161] RIP: 0033:0x7fa0c8737497

    [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Guoqing Jiang
     

08 Apr, 2020

1 commit

  • commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 upstream.

    This reverts commit effd58c95f277744f75d6e08819ac859dbcbd351.

    blk_queue_split() is causing excessive IO splitting -- because
    blk_max_size_offset() depends on 'chunk_sectors' limit being set and
    if it isn't (as is the case for DM targets!) it falls back to
    splitting on a 'max_sectors' boundary regardless of offset.

    "Fix" this by reverting back to _not_ using blk_queue_split() in
    dm_process_bio() for normal IO (reads and writes). Long-term fix is
    still TBD but it should focus on training blk_max_size_offset() to
    call into a DM provided hook (to call DM's max_io_len()).

    Test results from simple misaligned IO test on 4-way dm-striped device
    with chunksize of 128K and stripesize of 512K:

    xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

    before this revert:

    253,0 21 1 0.000000000 2206 Q R 224 + 4072 [xfs_io]
    253,0 21 2 0.000008267 2206 X R 224 / 480 [xfs_io]
    253,0 21 3 0.000010530 2206 X R 224 / 256 [xfs_io]
    253,0 21 4 0.000027022 2206 X R 480 / 736 [xfs_io]
    253,0 21 5 0.000028751 2206 X R 480 / 512 [xfs_io]
    253,0 21 6 0.000033323 2206 X R 736 / 992 [xfs_io]
    253,0 21 7 0.000035130 2206 X R 736 / 768 [xfs_io]
    253,0 21 8 0.000039146 2206 X R 992 / 1248 [xfs_io]
    253,0 21 9 0.000040734 2206 X R 992 / 1024 [xfs_io]
    253,0 21 10 0.000044694 2206 X R 1248 / 1504 [xfs_io]
    253,0 21 11 0.000046422 2206 X R 1248 / 1280 [xfs_io]
    253,0 21 12 0.000050376 2206 X R 1504 / 1760 [xfs_io]
    253,0 21 13 0.000051974 2206 X R 1504 / 1536 [xfs_io]
    253,0 21 14 0.000055881 2206 X R 1760 / 2016 [xfs_io]
    253,0 21 15 0.000057462 2206 X R 1760 / 1792 [xfs_io]
    253,0 21 16 0.000060999 2206 X R 2016 / 2272 [xfs_io]
    253,0 21 17 0.000062489 2206 X R 2016 / 2048 [xfs_io]
    253,0 21 18 0.000066133 2206 X R 2272 / 2528 [xfs_io]
    253,0 21 19 0.000067507 2206 X R 2272 / 2304 [xfs_io]
    253,0 21 20 0.000071136 2206 X R 2528 / 2784 [xfs_io]
    253,0 21 21 0.000072764 2206 X R 2528 / 2560 [xfs_io]
    253,0 21 22 0.000076185 2206 X R 2784 / 3040 [xfs_io]
    253,0 21 23 0.000077486 2206 X R 2784 / 2816 [xfs_io]
    253,0 21 24 0.000080885 2206 X R 3040 / 3296 [xfs_io]
    253,0 21 25 0.000082316 2206 X R 3040 / 3072 [xfs_io]
    253,0 21 26 0.000085788 2206 X R 3296 / 3552 [xfs_io]
    253,0 21 27 0.000087096 2206 X R 3296 / 3328 [xfs_io]
    253,0 21 28 0.000093469 2206 X R 3552 / 3808 [xfs_io]
    253,0 21 29 0.000095186 2206 X R 3552 / 3584 [xfs_io]
    253,0 21 30 0.000099228 2206 X R 3808 / 4064 [xfs_io]
    253,0 21 31 0.000101062 2206 X R 3808 / 3840 [xfs_io]
    253,0 21 32 0.000104956 2206 X R 4064 / 4096 [xfs_io]
    253,0 21 33 0.001138823 0 C R 4096 + 200 [0]

    after this revert:

    253,0 18 1 0.000000000 4430 Q R 224 + 3896 [xfs_io]
    253,0 18 2 0.000018359 4430 X R 224 / 256 [xfs_io]
    253,0 18 3 0.000028898 4430 X R 256 / 512 [xfs_io]
    253,0 18 4 0.000033535 4430 X R 512 / 768 [xfs_io]
    253,0 18 5 0.000065684 4430 X R 768 / 1024 [xfs_io]
    253,0 18 6 0.000091695 4430 X R 1024 / 1280 [xfs_io]
    253,0 18 7 0.000098494 4430 X R 1280 / 1536 [xfs_io]
    253,0 18 8 0.000114069 4430 X R 1536 / 1792 [xfs_io]
    253,0 18 9 0.000129483 4430 X R 1792 / 2048 [xfs_io]
    253,0 18 10 0.000136759 4430 X R 2048 / 2304 [xfs_io]
    253,0 18 11 0.000152412 4430 X R 2304 / 2560 [xfs_io]
    253,0 18 12 0.000160758 4430 X R 2560 / 2816 [xfs_io]
    253,0 18 13 0.000183385 4430 X R 2816 / 3072 [xfs_io]
    253,0 18 14 0.000190797 4430 X R 3072 / 3328 [xfs_io]
    253,0 18 15 0.000197667 4430 X R 3328 / 3584 [xfs_io]
    253,0 18 16 0.000218751 4430 X R 3584 / 3840 [xfs_io]
    253,0 18 17 0.000226005 4430 X R 3840 / 4096 [xfs_io]
    253,0 18 18 0.000250404 4430 Q R 4120 + 176 [xfs_io]
    253,0 18 19 0.000847708 0 C R 4096 + 24 [0]
    253,0 18 20 0.000855783 0 C R 4120 + 176 [0]

    Fixes: effd58c95f27774 ("dm: always call blk_queue_split() in dm_process_bio()")
    Cc: stable@vger.kernel.org
    Reported-by: Andreas Gruenbacher
    Tested-by: Barry Marson
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

25 Mar, 2020

2 commits

  • [ Upstream commit 248aa2645aa7fc9175d1107c2593cc90d4af5a4e ]

    In cases where dec_in_flight() has to requeue the integrity_bio_wait
    work to transfer the rest of the data, the bio's __bi_remaining might
    already have been decremented to 0, e.g.: if bio passed to underlying
    data device was split via blk_queue_split().

    Use dm_bio_{record,restore} rather than effectively open-coding them in
    dm-integrity -- these methods now manage __bi_remaining too.

    Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
    Reported-by: Daniel Glöckner
    Suggested-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Mike Snitzer
     
  • [ Upstream commit 1b17159e52bb31f982f82a6278acd7fab1d3f67b ]

    Also, save/restore __bi_remaining in case the bio was used in a
    BIO_CHAIN (e.g. due to blk_queue_split).

    Suggested-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Mike Snitzer
     

12 Mar, 2020

9 commits

  • commit 974f51e8633f0f3f33e8f86bbb5ae66758aa63c7 upstream.

    We neither assign congested_fn for requested-based blk-mq device nor
    implement it correctly. So fix both.

    Also, remove incorrect comment from dm_init_normal_md_queue and rename
    it to dm_init_congested_fn.

    Fixes: 4aa9c692e052 ("bdi: separate out congested state into a separate struct")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit ee63634bae02e13c8c0df1209a6a0ca5326f3189 upstream.

    Dm-zoned initializes reference counters of new chunk works with zero
    value and refcount_inc() is called to increment the counter. However, the
    refcount_inc() function handles the addition to zero value as an error
    and triggers the warning as follows:

    refcount_t: addition on 0; use-after-free.
    WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
    ...
    CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
    ...
    Call Trace:
    dmz_map+0x2d2/0x350 [dm_zoned]
    __map_bio+0x42/0x1a0
    __split_and_process_non_flush+0x14a/0x1b0
    __split_and_process_bio+0x83/0x240
    ? kmem_cache_alloc+0x165/0x220
    dm_process_bio+0x90/0x230
    ? generic_make_request_checks+0x2e7/0x680
    dm_make_request+0x3e/0xb0
    generic_make_request+0xcf/0x320
    ? memcg_drain_all_list_lrus+0x1c0/0x1c0
    submit_bio+0x3c/0x160
    ? guard_bio_eod+0x2c/0x130
    mpage_readpages+0x182/0x1d0
    ? bdev_evict_inode+0xf0/0xf0
    read_pages+0x6b/0x1b0
    __do_page_cache_readahead+0x1ba/0x1d0
    force_page_cache_readahead+0x93/0x100
    generic_file_read_iter+0x83a/0xe40
    ? __seccomp_filter+0x7b/0x670
    new_sync_read+0x12a/0x1c0
    vfs_read+0x9d/0x150
    ksys_read+0x5f/0xe0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ...

    After this warning, following refcount API calls for the counter all fail
    to change the counter value.

    Fix this by setting the initial reference counter value not zero but one
    for the new chunk works. Instead, do not call refcount_inc() via
    dmz_get_chunk_work() for the new chunks works.

    The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
    enabled. Refcount rework was merged to linux version 5.5 by the
    commit 168829ad09ca ("Merge branch 'locking-core-for-linus' of
    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
    commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
    regardless of kernel configuration.

    Linux version 4.20 merged the commit 092b5648760a ("dm zoned: target: use
    refcount_t for dm zoned reference counters"). Before this commit, dm
    zoned used atomic_t APIs which does not check addition to zero, then this
    fix is not necessary.

    Fixes: 092b5648760a ("dm zoned: target: use refcount_t for dm zoned reference counters")
    Cc: stable@vger.kernel.org # 5.4+
    Signed-off-by: Shin'ichiro Kawasaki
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Shin'ichiro Kawasaki
     
  • commit 41c526c5af46d4c4dab7f72c99000b7fac0b9702 upstream.

    Verify the watermark upon resume - so that if the target is reloaded
    with lower watermark, it will start the cleanup process immediately.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # 4.18+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 upstream.

    The function dm_suspended returns true if the target is suspended.
    However, when the target is being suspended during unload, it returns
    false.

    An example where this is a problem: the test "!dm_suspended(wc->ti)" in
    writecache_writeback is not sufficient, because dm_suspended returns
    zero while writecache_suspend is in progress. As is, without an
    enhanced dm_suspended, simply switching from flush_workqueue to
    drain_workqueue still emits warnings:
    workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries

    writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
    flushes the current work. However, the workqueue may re-queue itself and
    flush_workqueue doesn't wait for re-queued works to finish. Because of
    this - the function writecache_writeback continues execution after the
    device was suspended and then concurrently with writecache_dtr, causing
    a crash in writecache_writeback.

    We must use drain_workqueue - that waits until the work and all re-queued
    works finish.

    As a prereq for switching to drain_workqueue, this commit fixes
    dm_suspended to return true after the presuspend hook and before the
    postsuspend hook - just like during a normal suspend. It allows
    simplifying the dm-integrity and dm-writecache targets so that they
    don't have to maintain suspended flags on their own.

    With this change use of drain_workqueue() can be used effectively. This
    change was tested with the lvm2 testsuite and cryptsetup testsuite and
    the are no regressions.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # 4.18+
    Reported-by: Corey Marthaler
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 7cdf6a0aae1cccf5167f3f04ecddcf648b78e289 upstream.

    The crash can be reproduced by running the lvm2 testsuite test
    lvconvert-thin-external-cache.sh for several minutes, e.g.:
    while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

    The crash happens in this call chain:
    do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
    -> memset -> __memset -- which accesses an invalid pointer in the vmalloc
    area.

    The work entry on the workqueue is executed even after the bitmap was
    freed. The problem is that cancel_delayed_work doesn't wait for the
    running work item to finish, so the work item can continue running and
    re-submitting itself even after cache_postsuspend. In order to make sure
    that the work item won't be running, we must use cancel_delayed_work_sync.

    Also, change flush_workqueue to drain_workqueue, so that if some work item
    submits itself or another work item, we are properly waiting for both of
    them.

    Fixes: c6b4fcbad044 ("dm: add cache target")
    Cc: stable@vger.kernel.org # v3.9
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 7fc2e47f40dd77ab1fcbda6db89614a0173d89c7 upstream.

    If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
    not specified on the command line (i.e. ic->recalculate_flag is false),
    dm-integrity would return invalid table line - the reported number of
    arguments would not match the real number.

    Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
    Cc: stable@vger.kernel.org # v5.2+
    Reported-by: Ondrej Kozina
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 53770f0ec5fd417429775ba006bc4abe14002335 upstream.

    If we need to perform synchronous I/O in dm_integrity_map_continue(),
    we must make sure that we are not in the map function - in order to
    avoid the deadlock due to bio queuing in generic_make_request. To
    avoid the deadlock, we offload the request to metadata_wq.

    However, metadata_wq also processes metadata updates for write requests.
    If there are too many requests that get offloaded to metadata_wq at the
    beginning of dm_integrity_map_continue, the workqueue metadata_wq
    becomes clogged and the system is incapable of processing any metadata
    updates.

    This causes a deadlock because all the requests that need to do metadata
    updates wait for metadata_wq to proceed and metadata_wq waits inside
    wait_and_add_new_range until some existing request releases its range
    lock (which doesn't happen because the range lock is released after
    metadata update).

    In order to fix the deadlock, we create a new workqueue offload_wq and
    offload requests to it - so that processing of offload_wq is independent
    from processing of metadata_wq.

    Fixes: 7eada909bfd7 ("dm: add integrity target")
    Cc: stable@vger.kernel.org # v4.12+
    Reported-by: Heinz Mauelshagen
    Tested-by: Heinz Mauelshagen
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit d5bdf66108419cdb39da361b58ded661c29ff66e upstream.

    If we resume a device in bitmap mode and the on-disk format is in journal
    mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
    there would be non-recalculated blocks which would cause I/O errors.

    Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • [ Upstream commit 3918e0667bbac99400b44fa5aef3f8be2eeada4a ]

    [ 3934.173244] ======================================================
    [ 3934.179572] WARNING: possible circular locking dependency detected
    [ 3934.185884] 5.4.21-xfstests #1 Not tainted
    [ 3934.190151] ------------------------------------------------------
    [ 3934.196673] dmsetup/8897 is trying to acquire lock:
    [ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
    [ 3934.210268]
    but task is already holding lock:
    [ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
    [ 3934.225083]
    which lock already depends on the new lock.

    [ 3934.564165] Chain exists of:
    shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock

    For a more detailed lockdep report, please see:

    https://lore.kernel.org/r/20200220234519.GA620489@mit.edu

    We shouldn't need to hold the lock while are just tearing down and
    freeing the whole metadata pool structure.

    Fixes: 44d8ebf436399a4 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Theodore Ts'o
     

24 Feb, 2020

1 commit

  • [ Upstream commit 29cda393bcaad160c4bf3676ddd99855adafc72f ]

    Patch "bcache: rework error unwinding in register_bcache" from
    Christoph Hellwig changes the local variables 'path' and 'err'
    in undefined initial state. If the code in register_bcache() jumps
    to label 'out:' or 'out_module_put:' by goto, these two variables
    might be reference with undefined value by the following line,

    out_module_put:
    module_put(THIS_MODULE);
    out:
    pr_info("error %s: %s", path, err);
    return ret;

    Therefore this patch initializes these two local variables properly
    in register_bcache() to avoid such issue.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li