18 Oct, 2018

4 commits

  • commit 118aa47c7072bce05fc39bd40a1c0a90caed72ab upstream.

    The dm-linear target is independent of the dm-zoned target. For code
    requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED
    instead of CONFIG_DM_ZONED.

    While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM
    feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined.

    Fixes: beb9caac211c1 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled")
    Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit beb9caac211c1be1bc118bb62d5cf09c4107e6a5 upstream.

    It is best to avoid any extra overhead associated with bio completion.
    DM core will indirectly call a DM target's .end_io if it is defined.
    In the case of DM linear, there is no need to do so (for every bio that
    completes) if CONFIG_DM_ZONED is not enabled.

    Avoiding an extra indirect call for every bio completion is very
    important for ensuring DM linear doesn't incur more overhead that
    further widens the performance gap between dm-linear and raw block
    devices.

    Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 9864cd5dc54cade89fd4b0954c2e522841aa247c upstream.

    If dm-linear or dm-flakey are layered on top of a partition of a zoned
    block device, remapping of the start sector and write pointer position
    of the zones reported by a report zones BIO must be modified to account
    for the target table entry mapping (start offset within the device and
    entry mapping with the dm device). If the target's backing device is a
    partition of a whole disk, the start sector on the physical device of
    the partition must also be accounted for when modifying the zone
    information. However, dm_remap_zone_report() was not considering this
    last case, resulting in incorrect zone information remapping with
    targets using disk partitions.

    Fix this by calculating the target backing device start sector using
    the position of the completed report zones BIO and the unchanged
    position and size of the original report zone BIO. With this value
    calculated, the start sector and write pointer position of the target
    zones can be correctly remapped.

    Fixes: 10999307c14e ("dm: introduce dm_remap_zone_report()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit c7cd55504a5b0fc826a2cd9540845979d24ae542 upstream.

    Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target
    after module __init resources created") inadvertently introduced this
    bug when it moved dm_register_target() after the call to KMEM_CACHE().

    Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created")
    Cc: stable@vger.kernel.org
    Signed-off-by: Shenghui Wang
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Shenghui Wang
     

13 Oct, 2018

2 commits

  • commit 5d07384a666d4b2f781dc056bfeec2c27fbdf383 upstream.

    A reload of the cache's DM table is needed during resize because
    otherwise a crash will occur when attempting to access smq policy
    entries associated with the portion of the cache that was recently
    extended.

    The reason is cache-size based data structures in the policy will not be
    resized, the only way to safely extend the cache is to allow for a
    proper cache policy initialization that occurs when the cache table is
    loaded. For example the smq policy's space_init(), init_allocator(),
    calc_hotspot_params() must be sized based on the extended cache size.

    The fix for this is to disallow cache resizes of this pattern:
    1) suspend "cache" target's device
    2) resize the fast device used for the cache
    3) resume "cache" target's device

    Instead, the last step must be a full reload of the cache's DM table.

    Fixes: 66a636356 ("dm cache: add stochastic-multi-queue (smq) policy")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 4561ffca88c546f96367f94b8f1e4715a9c62314 upstream.

    Commit fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to
    on-disk superblock") enabled previously written policy hints to be
    used after a cache is reactivated. But in doing so the cache
    metadata's hint array was left exposed to out of bounds access because
    on resize the metadata's on-disk hint array wasn't ever extended.

    Fix this by ignoring that there are no on-disk hints associated with the
    newly added cache blocks. An expanded on-disk hint array is later
    rewritten upon the next clean shutdown of the cache.

    Fixes: fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to on-disk superblock")
    Cc: stable@vger.kernel.org
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

10 Oct, 2018

5 commits

  • commit 013ad043906b2befd4a9bfb06219ed9fedd92716 upstream.

    sector_div() is only viable for use with sector_t.
    dm_block_t is typedef'd to uint64_t -- so use div_u64() instead.

    Fixes: 3ab918281 ("dm thin metadata: try to avoid ever aborting transactions")
    Signed-off-by: Mike Snitzer
    Cc: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • [ Upstream commit 3ab91828166895600efd9cdc3a0eb32001f7204a ]

    Committing a transaction can consume some metadata of it's own, we now
    reserve a small amount of metadata to cover this. Free metadata
    reported by the kernel will not include this reserve.

    If any of the reserve has been used after a commit we enter a new
    internal state PM_OUT_OF_METADATA_SPACE. This is reported as
    PM_READ_ONLY, so no userland changes are needed. If the metadata
    device is resized the pool will move back to PM_WRITE.

    These changes mean we never need to abort and rollback a transaction due
    to running out of metadata space. This is particularly important
    because there have been a handful of reports of data corruption against
    DM thin-provisioning that can all be attributed to the thin-pool having
    ran out of metadata space.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     
  • [ Upstream commit c44a5ee803d2b7ed8c2e6ce24a5c4dd60778886e ]

    Update superblock when particular devices are requested via rebuild
    (e.g. lvconvert --replace ...) to avoid spurious failure with the "New
    device injected into existing raid set without 'delta_disks' or
    'rebuild' parameter specified" error message.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Heinz Mauelshagen
     
  • [ Upstream commit 1d0ffd264204eba1861865560f1f7f7a92919384 ]

    In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
    have bad blocks, the max_sectors is less than last. It will call goto read_more many
    times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
    sectors_done is not 0. So the value passed to the argument force of raise_barrier is
    true.

    In raise_barrier it checks conf->barrier when force is true. If force is true and
    conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
    disks. And in the callback function of the bio it calls lower_barrier. If the bio
    finishes before calling raise_barrier again, it can trigger the BUG_ON.

    Add one pair of raise_barrier/lower_barrier to fix this bug.

    Signed-off-by: Xiao Ni
    Suggested-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xiao Ni
     
  • [ Upstream commit e254de6bcf3f5b6e78a92ac95fb91acef8adfe1a ]

    We don't support reshape yet if an array supports log device. Previously we
    determine the fact by checking ->log. However, ->log could be NULL after a log
    device is removed, but the array is still marked to support log device. Don't
    allow reshape in this case too. User can disable log device support by setting
    'consistency_policy' to 'resync' then do reshape.

    Reported-by: Xiao Ni
    Tested-by: Xiao Ni
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

04 Oct, 2018

1 commit

  • [ Upstream commit 010228e4a932ca1e8365e3b58c8e1e44c16ff793 ]

    When one node leaves cluster or stops the resyncing
    (resync or recovery) array, then other nodes need to
    call recover_bitmaps to continue the unfinished task.

    But we need to clear suspend_area later after other
    nodes copy the resync information to their bitmap
    (by call bitmap_copy_from_slot). Otherwise, all nodes
    could write to the suspend_area even the suspend_area
    is not handled by any node, because area_resyncing
    returns 0 at the beginning of raid1_write_request.
    Which means one node could write suspend_area while
    another node is resyncing the same area, then data
    could be inconsistent.

    So let's clear suspend_area later to avoid above issue
    with the protection of bm lock. Also it is straightforward
    to clear suspend_area after nodes have copied the resync
    info to bitmap.

    Signed-off-by: Guoqing Jiang
    Reviewed-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Guoqing Jiang
     

20 Sep, 2018

2 commits

  • [ Upstream commit af9313c32c0fa2a0ac3b113669273833d60cc9de ]

    More than one io_mode feature can be requested when creating a dm cache
    device (as is: last one wins). The io_mode selections are incompatible
    with one another, we should force them to be selected exclusively. Add
    a counter to check for more than one io_mode selection.

    Fixes: 629d0a8a1a10 ("dm cache metadata: add "metadata2" feature")
    Signed-off-by: John Pittman
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Pittman
     
  • [ Upstream commit d63e2fc804c46e50eee825c5d3a7228e07048b47 ]

    During raid5 replacement, the stripes can be marked with R5_NeedReplace
    flag. Data can be read from being-replaced devices and written to
    replacing spares without reading all other devices. (It's 'replace'
    mode. s.replacing = 1) If a being-replaced device is dropped, the
    replacement progress will be interrupted and resumed with pure recovery
    mode. However, existing stripes before being interrupted cannot read
    from the dropped device anymore. It prints lots of WARN_ON messages.
    And it results in data corruption because existing stripes write
    problematic data into its replacement device and update the progress.

    \# Erase disks (1MB + 2GB)
    dd if=/dev/zero of=/dev/sda bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
    mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
    \# Ensure array stores non-zero data
    dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
    \# Start replacement
    mdadm /dev/md0 -a /dev/sdd
    mdadm /dev/md0 --replace /dev/sda

    Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
    echo check > /sys/block/md0/md/sync_action
    cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

    Soon after you hot-plug out /dev/sda, you will see many WARN_ON
    messages. The replacement recovery will be interrupted shortly. After
    the recovery finishes, it will result in data corruption.

    Actually, it's just an unhandled case of replacement. In commit
    (md/raid5: fix interaction of 'replace' and 'recovery'.),
    if a NeedReplace device is not UPTODATE then that is an error, the
    commit just simply print WARN_ON but also mark these corrupted stripes
    with R5_WantReplace. (it means it's ready for writes.)

    To fix this case, we can leverage 'sync and replace' mode mentioned in
    commit (md/raid5: detect and handle replacements during
    recovery.). We can add logics to detect and use 'sync and replace' mode
    for these stripes.

    Reported-by: Alex Chen
    Reviewed-by: Alex Wu
    Reviewed-by: Chung-Chiang Cheng
    Signed-off-by: BingJing Chang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    BingJing Chang
     

15 Sep, 2018

1 commit

  • [ Upstream commit 784c9a29e99eb40b842c29ecf1cc3a79e00fb629 ]

    It was reported that softlockups occur when using dm-snapshot ontop of
    slow (rbd) storage. E.g.:

    [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
    ...
    [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
    [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
    ...
    [ 4048.034190] Call Trace:
    [ 4048.034196] ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
    [ 4048.034200] run_complete_job+0x5f/0xb0 [dm_mod]
    [ 4048.034205] process_jobs+0x91/0x220 [dm_mod]
    [ 4048.034210] ? kcopyd_put_pages+0x40/0x40 [dm_mod]
    [ 4048.034214] do_work+0x46/0xa0 [dm_mod]
    [ 4048.034219] process_one_work+0x171/0x370
    [ 4048.034221] worker_thread+0x1fc/0x3f0
    [ 4048.034224] kthread+0xf8/0x130
    [ 4048.034226] ? max_active_store+0x80/0x80
    [ 4048.034227] ? kthread_bind+0x10/0x10
    [ 4048.034231] ret_from_fork+0x35/0x40
    [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

    Fix this by calling cond_resched() after run_complete_job()'s callout to
    the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
    trace).

    Signed-off-by: John Pittman
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Pittman
     

10 Sep, 2018

6 commits

  • commit 3943b040f11ed0cc6d4585fd286a623ca8634547 upstream.

    The writeback thread would exit with a lock held when the cache device
    is detached via sysfs interface, fix it by releasing the held lock
    before exiting the while-loop.

    Fixes: fadd94e05c02 (bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set)
    Signed-off-by: Shan Hai
    Signed-off-by: Coly Li
    Tested-by: Shenghui Wang
    Cc: stable@vger.kernel.org #4.17+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Shan Hai
     
  • commit bc9e9cf0401f18e33b78d4c8a518661b8346baf7 upstream.

    dm-crypt should only increase device limits, it should not decrease them.

    This fixes a bug where the user could creates a crypt device with 1024
    sector size on the top of scsi device that had 4096 logical block size.
    The limit 4096 would be lost and the user could incorrectly send
    1024-I/Os to the crypt device.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 5b1fe7bec8a8d0cc547a22e7ddc2bd59acd67de4 upstream.

    Quoting Documentation/device-mapper/cache.txt:

    The 'dirty' state for a cache block changes far too frequently for us
    to keep updating it on the fly. So we treat it as a hint. In normal
    operation it will be written when the dm device is suspended. If the
    system crashes all cache blocks will be assumed dirty when restarted.

    This got broken in commit f177940a8091 ("dm cache metadata: switch to
    using the new cursor api for loading metadata") in 4.9, which removed
    the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
    flag) when loading cache blocks. This results in data corruption on an
    unclean shutdown with dirty cache blocks on the fast device. After the
    crash those blocks are considered clean and may get evicted from the
    cache at any time. This can be demonstrated by doing a lot of reads
    to trigger individual evictions, but uncache is more predictable:

    ### Disable auto-activation in lvm.conf to be able to do uncache in
    ### time (i.e. see uncache doing flushing) when the fix is applied.

    # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
    # vgcreate vg_cache /dev/vdb /dev/vdc
    # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
    # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
    # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
    # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
    # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
    # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # dmsetup status vg_cache-lv_slowdev
    0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
    ^^^^
    7065 * 64k = 441M yet to be written to the slow device
    # echo b >/proc/sysrq-trigger

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 0 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
    0fe00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................

    This is the case with both v1 and v2 cache pool metatata formats.

    After applying this patch:

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 3724 blocks for cache vg_cache/lv_slowdev.
    ...
    Flushing 71 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

    Cc: stable@vger.kernel.org
    Fixes: f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata")
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit fd2fa95416188a767a63979296fa3e169a9ef5ec upstream.

    policy_hint_size starts as 0 during __write_initial_superblock(). It
    isn't until the policy is loaded that policy_hint_size is set in-core
    (cmd->policy_hint_size). But it never got recorded in the on-disk
    superblock because __commit_transaction() didn't deal with transfering
    the in-core cmd->policy_hint_size to the on-disk superblock.

    The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
    __begin_transaction_flags() which re-reads all superblock fields.
    Because the superblock's policy_hint_size was never properly stored, when
    the cache was created, hints_array_available() would always return false
    when re-activating a previously created cache. This means
    __load_mappings() always considered the hints invalid and never made use
    of the hints (these hints served to optimize).

    Another detremental side-effect of this oversight is the cache_check
    utility would fail with: "invalid hint width: 0"

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 75294442d896f2767be34f75aca7cc2b0d01301f upstream.

    Now both check_for_space() and do_no_space_timeout() will read & write
    pool->pf.error_if_no_space. If these functions run concurrently, as
    shown in the following case, the default setting of "queue_if_no_space"
    can get lost.

    precondition:
    * error_if_no_space = false (aka "queue_if_no_space")
    * pool is in Out-of-Data-Space (OODS) mode
    * no_space_timeout worker has been queued

    CPU 0: CPU 1:
    // delete a thin device
    process_delete_mesg()
    // check_for_space() invoked by commit()
    set_pool_mode(pool, PM_WRITE)
    pool->pf.error_if_no_space = \
    pt->requested_pf.error_if_no_space

    // timeout, pool is still in OODS mode
    do_no_space_timeout
    // "queue_if_no_space" config is lost
    pool->pf.error_if_no_space = true
    pool->pf.mode = new_mode

    Fix it by stopping no_space_timeout worker when switching to write mode.

    Fixes: bcc696fac11f ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit c21b16392701543d61e366dca84e15fe7f0cf0cf upstream.

    Early alpha processors can't write a byte or short atomically - they
    read 8 bytes, modify the byte or two bytes in registers and write back
    8 bytes.

    The modification of the variable "suspending" may race with
    modification of the variable "failed". Fix this by changing
    "suspending" to an int.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

24 Aug, 2018

1 commit

  • [ Upstream commit bda3153998f3eb2cafa4a6311971143628eacdbc ]

    During assemble, the spare marked for replacement is not checked.
    conf->fullsync cannot be updated to be 1. As a result, recovery will
    treat it as a clean array. All recovering sectors are skipped. Original
    device is replaced with the not-recovered spare.

    mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123]
    mdadm /dev/md0 -a /dev/loop4
    mdadm /dev/md0 --replace /dev/loop0
    mdadm -S /dev/md0 # stop array during recovery

    mdadm -A /dev/md0 /dev/loop[01234]

    After reassemble, you can see recovery go on, but it completes
    immediately. In fact, recovery is not actually processed.

    To solve this problem, we just add the missing logics for replacment
    spares. (In raid1.c or raid5.c, they have already been checked.)

    Reported-by: Alex Chen
    Reviewed-by: Alex Wu
    Reviewed-by: Chung-Chiang Cheng
    Signed-off-by: BingJing Chang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    BingJing Chang
     

03 Aug, 2018

2 commits

  • [ Upstream commit c42a0e2675721e1444f56e6132a07b7b1ec169ac ]

    We met NULL pointer BUG as follow:

    [ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
    [ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
    [ 151.762039] Oops: 0000 [#1] SMP PTI
    [ 151.762406] Modules linked in:
    [ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
    [ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
    [ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
    [ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
    [ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
    [ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
    [ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
    [ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
    [ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
    [ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
    [ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
    [ 151.771272] Call Trace:
    [ 151.771542] md_ioctl+0x1df2/0x1e10
    [ 151.771906] ? __switch_to+0x129/0x440
    [ 151.772295] ? __schedule+0x244/0x850
    [ 151.772672] blkdev_ioctl+0x4bd/0x970
    [ 151.773048] block_ioctl+0x39/0x40
    [ 151.773402] do_vfs_ioctl+0xa4/0x610
    [ 151.773770] ? dput.part.23+0x87/0x100
    [ 151.774151] ksys_ioctl+0x70/0x80
    [ 151.774493] __x64_sys_ioctl+0x16/0x20
    [ 151.774877] do_syscall_64+0x5b/0x180
    [ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    For raid6, when two disk of the array are offline, two spare disks can
    be added into the array. Before spare disks recovery completing,
    system reboot and mdadm thinks it is ok to restart the degraded
    array by md_ioctl(). Since disks in raid6 is not only_parity(),
    raid5_run() will abort, when there is no PPL feature or not setting
    'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.

    But, mddev->raid_disks has been set and it will not be cleared when
    raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
    remove a disk by mdadm, which will cause NULL pointer dereference
    in remove_and_add_spares() finally.

    Signed-off-by: Yufen Yu
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yufen Yu
     
  • [ Upstream commit b33d10624fdc15cdf1495f3f00481afccec76783 ]

    Current handle_read_error() function calls fix_read_error()
    only if md device is RW and rdev does not include FailFast flag.
    It does not handle a read error from a RW device including
    FailFast flag.

    I am not sure it is intended. But I found that write IO error
    sets rdev faulty. The md module should handle the read IO error and
    write IO error equally. So I think read IO error should set rdev faulty.

    Signed-off-by: Gioh Kim
    Reviewed-by: Jack Wang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Gioh Kim
     

11 Jul, 2018

2 commits

  • commit dbc626597c39b24cefce09fbd8e9dea85869a801 upstream.

    Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
    flag is set on the device's request queue to decide whether or not the
    device supports filesystem DAX. Really we should be using
    bdev_dax_supported() like filesystems do at mount time. This performs
    other tests like checking to make sure the dax_direct_access() path works.

    We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
    any of the underlying devices do not support DAX. This makes the handling
    of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
    in dm_table_set_restrictions().

    Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
    will ensure that filesystems built upon DM devices will only be able to
    mount with DAX if all underlying devices also support DAX.

    Signed-off-by: Ross Zwisler
    Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
    Cc: stable@vger.kernel.org
    Acked-by: Dan Williams
    Reviewed-by: Toshi Kani
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     
  • commit ad3793fc3945173f64d82d05d3ecde41f6c0435c upstream.

    Rather than having DAX support be unique by setting it based on table
    type in dm_setup_md_queue().

    Signed-off-by: Mike Snitzer
    Signed-off-by: Ross Zwisler
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

08 Jul, 2018

6 commits

  • commit b03e0ccb5ab9df3efbe51c87843a1ffbecbafa1f upstream.

    The '2' argument means "wake up anything that is waiting".
    This is an inelegant part of the design and was added
    to help support management of suspend_lo/suspend_hi setting.
    Now that suspend_lo/hi is managed in mddev_suspend/resume,
    that need is gone.
    These is still a couple of places where we call 'quiesce'
    with an argument of '2', but they can safely be changed to
    call ->quiesce(.., 1); ->quiesce(.., 0) which
    achieve the same result at the small cost of pausing IO
    briefly.

    This removes a small "optimization" from suspend_{hi,lo}_store,
    but it isn't clear that optimization served a useful purpose.
    The code now is a lot clearer.

    Suggested-by: Shaohua Li
    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 35bfc52187f6df8779d0f1cebdb52b7f797baf4e upstream.

    There are various deadlocks that can occur
    when a thread holds reconfig_mutex and calls
    ->quiesce(mddev, 1).
    As some write request block waiting for
    metadata to be updated (e.g. to record device
    failure), and as the md thread updates the metadata
    while the reconfig mutex is held, holding the mutex
    can stop write requests completing, and this prevents
    ->quiesce(mddev, 1) from completing.

    ->quiesce() is now usually called from mddev_suspend(),
    and it is always called with reconfig_mutex held. So
    at this time it is safe for the thread to update metadata
    without explicitly taking the lock.

    So add 2 new flags, one which says the unlocked updates is
    allowed, and one which ways it is happening. Then allow it
    while the quiesce completes, and then wait for it to finish.

    Reported-and-tested-by: Xiao Ni
    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 9e1cc0a54556a6c63dc0cfb7cd7d60d43337bba6 upstream.

    mddev_suspend() is a more general interface than
    calling ->quiesce() and is so more extensible. A
    future patch will make use of this.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit b3143b9a38d5039bcd1f2d1c94039651bfba8043 upstream.

    responding to ->suspend_lo and ->suspend_hi is similar
    to responding to ->suspended. It is best to wait in
    the common core code without incrementing ->active_io.
    This allows mddev_suspend()/mddev_resume() to work while
    requests are waiting for suspend_lo/hi to change.
    This is will be important after a subsequent patch
    which uses mddev_suspend() to synchronize updating for
    suspend_lo/hi.

    So move the code for testing suspend_lo/hi out of raid1.c
    and raid5.c, and place it in md.c

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 52a0d49de3d592a3118e13f35985e3d99eaf43df upstream.

    bitmap_create() allocates memory with GFP_KERNEL and
    so can wait for IO.
    If called while the array is quiesced, it could wait indefinitely
    for write out to the array - deadlock.
    So call bitmap_create() before quiescing the array.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 4d5324f760aacaefeb721b172aa14bf66045c332 upstream.

    Most often mddev_suspend() is called with
    reconfig_mutex held. Make this a requirement in
    preparation a subsequent patch. Also require
    reconfig_mutex to be held for mddev_resume(),
    partly for symmetry and partly to guarantee
    no races with incr/decr of mddev->suspend.

    Taking the mutex in r5c_disable_writeback_async() is
    a little tricky as this is called from a work queue
    via log->disable_writeback_work, and flush_work()
    is called on that while holding ->reconfig_mutex.
    If the work item hasn't run before flush_work()
    is called, the work function will not be able to
    get the mutex.

    So we use mddev_trylock() inside the wait_event() call, and have that
    abort when conf->log is set to NULL, which happens before
    flush_work() is called.
    We wait in mddev->sb_wait and ensure this is woken
    when any of the conditions change. This requires
    waking mddev->sb_wait in mddev_unlock(). This is only
    like to trigger extra wake_ups of threads that needn't
    be woken when metadata is being written, and that
    doesn't happen often enough that the cost would be
    noticeable.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

03 Jul, 2018

3 commits

  • commit a685557fbbc3122ed11e8ad3fa63a11ebc5de8c3 upstream.

    Discards issued to a DM thin device can complete to userspace (via
    fstrim) _before_ the metadata changes associated with the discards is
    reflected in the thinp superblock (e.g. free blocks). As such, if a
    user constructs a test that loops repeatedly over these steps, block
    allocation can fail due to discards not having completed yet:
    1) fill thin device via filesystem file
    2) remove file
    3) fstrim

    From initial report, here:
    https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html

    "The root cause of this issue is that dm-thin will first remove
    mapping and increase corresponding blocks' reference count to prevent
    them from being reused before DISCARD bios get processed by the
    underlying layers. However. increasing blocks' reference count could
    also increase the nr_allocated_this_transaction in struct sm_disk
    which makes smd->old_ll.nr_allocated +
    smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
    In this case, alloc_data_block() will never commit metadata to reset
    the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
    always return an underflow value."

    While there is room for improvement to the space-map accounting that
    thinp is making use of: the reality is this test is inherently racey and
    will result in the previous iteration's fstrim's discard(s) completing
    vs concurrent block allocation, via dd, in the next iteration of the
    loop.

    No amount of space map accounting improvements will be able to allow
    user's to use a block before a discard of that block has completed.

    So the best we can really do is allow DM thinp to gracefully handle such
    aggressive use of all the pool's data by degrading the pool into
    out-of-data-space (OODS) mode. We _should_ get that behaviour already
    (if space map accounting didn't falsely cause alloc_data_block() to
    believe free space was available).. but short of that we handle the
    current reality that dm_pool_alloc_data_block() can return -ENOSPC.

    Reported-by: Dennis Yang
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 2d0b2d64d325e22939d9db3ba784f1236459ed98 upstream.

    This patch avoids that lockdep reports the following:

    ======================================================
    WARNING: possible circular locking dependency detected
    4.18.0-rc1 #62 Not tainted
    ------------------------------------------------------
    kswapd0/84 is trying to acquire lock:
    00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0

    but task is already holding lock:
    00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (fs_reclaim){+.+.}:
    kmem_cache_alloc+0x2c/0x2b0
    radix_tree_node_alloc.constprop.19+0x3d/0xc0
    __radix_tree_create+0x161/0x1c0
    __radix_tree_insert+0x45/0x210
    dmz_map+0x245/0x2d0 [dm_zoned]
    __map_bio+0x40/0x260
    __split_and_process_non_flush+0x116/0x220
    __split_and_process_bio+0x81/0x180
    __dm_make_request.isra.32+0x5a/0x100
    generic_make_request+0x36e/0x690
    submit_bio+0x6c/0x140
    mpage_readpages+0x19e/0x1f0
    read_pages+0x6d/0x1b0
    __do_page_cache_readahead+0x21b/0x2d0
    force_page_cache_readahead+0xc4/0x100
    generic_file_read_iter+0x7c6/0xd20
    __vfs_read+0x102/0x180
    vfs_read+0x9b/0x140
    ksys_read+0x55/0xc0
    do_syscall_64+0x5a/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #1 (&dmz->chunk_lock){+.+.}:
    dmz_map+0x133/0x2d0 [dm_zoned]
    __map_bio+0x40/0x260
    __split_and_process_non_flush+0x116/0x220
    __split_and_process_bio+0x81/0x180
    __dm_make_request.isra.32+0x5a/0x100
    generic_make_request+0x36e/0x690
    submit_bio+0x6c/0x140
    _xfs_buf_ioapply+0x31c/0x590
    xfs_buf_submit_wait+0x73/0x520
    xfs_buf_read_map+0x134/0x2f0
    xfs_trans_read_buf_map+0xc3/0x580
    xfs_read_agf+0xa5/0x1e0
    xfs_alloc_read_agf+0x59/0x2b0
    xfs_alloc_pagf_init+0x27/0x60
    xfs_bmap_longest_free_extent+0x43/0xb0
    xfs_bmap_btalloc_nullfb+0x7f/0xf0
    xfs_bmap_btalloc+0x428/0x7c0
    xfs_bmapi_write+0x598/0xcc0
    xfs_iomap_write_allocate+0x15a/0x330
    xfs_map_blocks+0x1cf/0x3f0
    xfs_do_writepage+0x15f/0x7b0
    write_cache_pages+0x1ca/0x540
    xfs_vm_writepages+0x65/0xa0
    do_writepages+0x48/0xf0
    __writeback_single_inode+0x58/0x730
    writeback_sb_inodes+0x249/0x5c0
    wb_writeback+0x11e/0x550
    wb_workfn+0xa3/0x670
    process_one_work+0x228/0x670
    worker_thread+0x3c/0x390
    kthread+0x11c/0x140
    ret_from_fork+0x3a/0x50

    -> #0 (&xfs_nondir_ilock_class){++++}:
    down_read_nested+0x43/0x70
    xfs_free_eofblocks+0xa2/0x1e0
    xfs_fs_destroy_inode+0xac/0x270
    dispose_list+0x51/0x80
    prune_icache_sb+0x52/0x70
    super_cache_scan+0x127/0x1a0
    shrink_slab.part.47+0x1bd/0x590
    shrink_node+0x3b5/0x470
    balance_pgdat+0x158/0x3b0
    kswapd+0x1ba/0x600
    kthread+0x11c/0x140
    ret_from_fork+0x3a/0x50

    other info that might help us debug this:

    Chain exists of:
    &xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(&dmz->chunk_lock);
    lock(fs_reclaim);
    lock(&xfs_nondir_ilock_class);

    Bart Van Assche
     
  • commit 011abdc9df559ec75779bb7c53a744c69b2a94c6 upstream.

    If "re-add" is written to the "state" file for a device
    which is faulty, this has an effect similar to removing
    and re-adding the device. It should take up the
    same slot in the array that it previously had, and
    an accelerated (e.g. bitmap-based) rebuild should happen.

    The slot that "it previously had" is determined by
    rdev->saved_raid_disk.
    However this is not set when a device fails (only when a device
    is added), and it is cleared when resync completes.
    This means that "re-add" will normally work once, but may not work a
    second time.

    This patch includes two fixes.
    1/ when a device fails, record the ->raid_disk value in
    ->saved_raid_disk before clearing ->raid_disk
    2/ when "re-add" is written to a device for which
    ->saved_raid_disk is not set, fail.

    I think this is suitable for stable as it can
    cause re-adding a device to be forced to do a full
    resync which takes a lot longer and so puts data at
    more risk.

    Cc: (v4.1)
    Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities")
    Signed-off-by: NeilBrown
    Reviewed-by: Goldwyn Rodrigues
    Signed-off-by: Shaohua Li
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

30 May, 2018

5 commits

  • [ Upstream commit fadd94e05c02afec7b70b0b14915624f1782f578 ]

    In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
    cached_dev_get() is called when creating dc->writeback_thread, and
    cached_dev_put() is called when exiting dc->writeback_thread. This
    modification works well unless people detach the bcache device manually by
    'echo 1 > /sys/block/bcache/bcache/detach'
    Because this sysfs interface only calls bch_cached_dev_detach() which wakes
    up dc->writeback_thread but does not stop it. The reason is, before patch
    "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
    bch_writeback_thread(), if cache is not dirty after writeback,
    cached_dev_put() will be called here. And in cached_dev_make_request() when
    a new write request makes cache from clean to dirty, cached_dev_get() will
    be called there. Since we don't operate dc->count in these locations,
    refcount d->count cannot be dropped after cache becomes clean, and
    cached_dev_detach_finish() won't be called to detach bcache device.

    This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
    set inside bch_writeback_thread(). If this bit is set and cache is clean
    (no existing writeback_keys), break the while-loop, call cached_dev_put()
    and quit the writeback thread.

    Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
    writeback thread should continue to perform writeback, this is the original
    design of manually detach.

    It is safe to do the following check without locking, let me explain why,
    + if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
    + (!atomic_read(&dc->has_dirty) || !dc->writeback_running)) {

    If the kenrel thread does not sleep and continue to run due to conditions
    are not updated in time on the running CPU core, it just consumes more CPU
    cycles and has no hurt. This should-sleep-but-run is safe here. We just
    focus on the should-run-but-sleep condition, which means the writeback
    thread goes to sleep in mistake while it should continue to run.
    1, First of all, no matter the writeback thread is hung or not,
    kthread_stop() from cached_dev_detach_finish() will wake up it and
    terminate by making kthread_should_stop() return true. And in normal
    run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
    condition
    !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)
    is always true and can be ignored as constant value.
    2, If one of the following conditions is true, the writeback thread should
    go to sleep,
    "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)"
    each of them independently controls the writeback thread should sleep or
    not, let's analyse them one by one.
    2.1 condition "!atomic_read(&dc->has_dirty)"
    If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
    call bch_writeback_queue() immediately or call bch_writeback_add() which
    indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
    wake_up_process(dc->writeback_thread) is called. It sets writeback
    thread's task state to TASK_RUNNING and following an implicit memory
    barrier, then tries to wake up the writeback thread.
    In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
    doing the condition check. If other CPU core sets the TASK_RUNNING state
    after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
    will be scheduled to run very soon because its state is not
    TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
    writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
    of wake_up_process() will make sure modification of dc->has_dirty on
    other CPU core is updated and observed on the CPU core of writeback
    thread. Therefore the condition check will correctly be false, and
    continue writeback code without sleeping.
    2.2 condition "!dc->writeback_running)"
    dc->writeback_running can be changed via sysfs file, every time it is
    modified, a following bch_writeback_queue() is alwasy called. So the
    change is always observed on the CPU core of writeback thread. If
    dc->writeback_running is changed from 0 to 1 on other CPU core, this
    condition check will observe the modification and allow writeback
    thread to continue to run without sleeping.
    Now we can see, even without a locking protection, multiple conditions
    check is safe here, no deadlock or process hang up will happen.

    I compose a separte patch because that patch "bcache: fix cached_dev->count
    usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
    Reinecke. Also this fix is not trivial and good for a separate patch.

    Signed-off-by: Coly Li
    Reviewed-by: Michael Lyle
    Cc: Hannes Reinecke
    Cc: Huijun Tang
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Coly Li
     
  • [ Upstream commit 60eb34ec5526e264c2bbaea4f7512d714d791caf ]

    Kernel crashed when run fio in a RAID5 backend bcache device, the call
    trace is bellow:
    [ 440.012034] kernel BUG at block/blk-ioc.c:146!
    [ 440.012696] invalid opcode: 0000 [#1] SMP NOPTI
    [ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
    [ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
    /2015
    [ 440.028615] RIP: 0010:put_io_context+0x8b/0x90
    [ 440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
    [ 440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
    0f4240
    [ 440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
    94fca0
    [ 440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
    0c1700
    [ 440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
    01000
    [ 440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
    bd1e0
    [ 440.035239] FS: 0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
    0000000000000000
    [ 440.060190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
    606e0
    [ 440.110498] Call Trace:
    [ 440.135443] bio_disassociate_task+0x1b/0x60
    [ 440.160355] bio_free+0x1b/0x60
    [ 440.184666] bio_put+0x23/0x30
    [ 440.208272] search_free+0x23/0x40 [bcache]
    [ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache]
    [ 440.254468] closure_put+0xb6/0xd0 [bcache]
    [ 440.277087] request_endio+0x30/0x40 [bcache]
    [ 440.298703] bio_endio+0xa1/0x120
    [ 440.319644] handle_stripe+0x418/0x2270 [raid456]
    [ 440.340614] ? load_balance+0x17b/0x9c0
    [ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
    [ 440.380675] ? __release_stripe+0x15/0x20 [raid456]
    [ 440.400132] raid5d+0x3ed/0x5d0 [raid456]
    [ 440.419193] ? schedule+0x36/0x80
    [ 440.437932] ? schedule_timeout+0x1d2/0x2f0
    [ 440.456136] md_thread+0x122/0x150
    [ 440.473687] ? wait_woken+0x80/0x80
    [ 440.491411] kthread+0x102/0x140
    [ 440.508636] ? find_pers+0x70/0x70
    [ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0
    [ 440.541791] ret_from_fork+0x35/0x40
    [ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
    48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 0b
    0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
    [ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
    [ 440.628575] ---[ end trace a1fd79d85643a73e ]--

    All the crash issue happened when a bypass IO coming, in such scenario
    s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
    s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
    invalid, then kernel would crash when calling bio_put(). Maybe its upper
    layer's faulty, since bio should not be freed before we calling bio_put(),
    but we'd better calling bio_put() first before calling bio_complete() to
    notify upper layer ending this bio.

    This patch moves bio_complete() under bio_put() to avoid kernel crash.

    [mlyle: fixed commit subject for character limits]

    Reported-by: Matthias Ferdinand
    Tested-by: Matthias Ferdinand
    Signed-off-by: Tang Junhui
    Reviewed-by: Michael Lyle
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tang Junhui
     
  • [ Upstream commit 3de59bb9d551428cbdc76a9ea57883f82e350b4d ]

    In handle_write_finished(), if r1_bio->bios[m] != NULL, it thinks
    the corresponding conf->mirrors[m].rdev is also not NULL. But, it
    is not always true.

    Even if some io hold replacement rdev(i.e. rdev->nr_pending.count > 0),
    raid1_remove_disk() can also set the rdev as NULL. That means,
    bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL
    pointer dereference in handle_write_finished and sync_request_write.

    This patch can fix BUGs as follows:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000140
    IP: [] raid1d+0x2bd/0xfc0
    PGD 12ab52067 PUD 12f587067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
    Call Trace:
    ? schedule+0x37/0x90
    ? prepare_to_wait_event+0x83/0xf0
    md_thread+0x144/0x150
    ? wake_atomic_t_function+0x70/0x70
    ? md_start_sync+0xf0/0xf0
    kthread+0xd8/0xf0
    ? kthread_worker_fn+0x160/0x160
    ret_from_fork+0x42/0x70
    ? kthread_worker_fn+0x160/0x160

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
    IP: sync_request_write+0x9e/0x980
    PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0
    Oops: 0000 [#1] SMP PTI
    CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
    Call Trace:
    ? sched_clock+0x5/0x10
    ? sched_clock_cpu+0xc/0xb0
    ? flush_pending_writes+0x3a/0xd0
    ? pick_next_task_fair+0x4d5/0x5f0
    ? __switch_to+0xa2/0x430
    raid1d+0x65a/0x870
    ? find_pers+0x70/0x70
    ? find_pers+0x70/0x70
    ? md_thread+0x11c/0x160
    md_thread+0x11c/0x160
    ? finish_wait+0x80/0x80
    kthread+0x111/0x130
    ? kthread_create_worker_on_cpu+0x70/0x70
    ? do_syscall_64+0x6f/0x190
    ? SyS_exit_group+0x10/0x10
    ret_from_fork+0x35/0x40

    Reviewed-by: NeilBrown
    Signed-off-by: Yufen Yu
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yufen Yu
     
  • [ Upstream commit 8876391e440ba615b10eef729576e111f0315f87 ]

    There is a potential deadlock if mount/umount happens when
    raid5_finish_reshape() tries to grow the size of emulated disk.

    How the deadlock happens?
    1) The raid5 resync thread finished reshape (expanding array).
    2) The mount or umount thread holds VFS sb->s_umount lock and tries to
    write through critical data into raid5 emulated block device. So it
    waits for raid5 kernel thread handling stripes in order to finish it
    I/Os.
    3) In the routine of raid5 kernel thread, md_check_recovery() will be
    called first in order to reap the raid5 resync thread. That is,
    raid5_finish_reshape() will be called. In this function, it will try
    to update conf and call VFS revalidate_disk() to grow the raid5
    emulated block device. It will try to acquire VFS sb->s_umount lock.
    The raid5 kernel thread cannot continue, so no one can handle mount/
    umount I/Os (stripes). Once the write-through I/Os cannot be finished,
    mount/umount will not release sb->s_umount lock. The deadlock happens.

    The raid5 kernel thread is an emulated block device. It is responible to
    handle I/Os (stripes) from upper layers. The emulated block device
    should not request any I/Os on itself. That is, it should not call VFS
    layer functions. (If it did, it will try to acquire VFS locks to
    guarantee the I/Os sequence.) So we have the resync thread to send
    resync I/O requests and to wait for the results.

    For solving this potential deadlock, we can put the size growth of the
    emulated block device as the final step of reshape thread.

    2017/12/29:
    Thanks to Guoqing Jiang ,
    we confirmed that there is the same deadlock issue in raid10. It's
    reproducible and can be fixed by this patch. For raid10.c, we can remove
    the similar code to prevent deadlock as well since they has been called
    before.

    Reported-by: Alex Wu
    Reviewed-by: Alex Wu
    Reviewed-by: Chung-Chiang Cheng
    Signed-off-by: BingJing Chang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    BingJing Chang
     
  • [ Upstream commit 53b8d89ddbdbb0e4625a46d2cdbb6f106c52f801 ]

    gcc warns about a possible overflow of the kmem_cache string, when adding
    four characters to a string of the same length:

    drivers/md/raid5.c: In function 'setup_conf':
    drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
    sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
    ^~~~
    drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
    sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    If I'm counting correctly, we need 11 characters for the fixed part
    of the string and 18 characters for a 64-bit pointer (when no gendisk
    is used), so that leaves three characters for conf->level, which should
    always be sufficient.

    This makes the code use snprintf() with the correct length, to
    make the code more robust against changes, and to get the compiler
    to shut up.

    In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
    kmem_cache when mddev has no gendisk") from 2010, Neil said that
    the pointer could be removed "shortly" once devices without gendisk
    are disallowed. I have no idea if that happened, but if it did, that
    should probably be changed as well.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann