17 Apr, 2019

4 commits

  • commit 4ed319c6ac08e9a28fca7ac188181ac122f4de84 upstream.

    dm-integrity will deadlock if overlapping I/O is issued to it, the bug
    was introduced by commit 724376a04d1a ("dm integrity: implement fair
    range locks"). Users rarely use overlapping I/O so this bug went
    undetected until now.

    Fix this bug by correcting, likely cut-n-paste, typos in
    ranges_overlap() and also remove a flawed ranges_overlap() check in
    remove_range_unlocked(). This condition could leave unprocessed bios
    hanging on wait_list forever.

    Cc: stable@vger.kernel.org # v4.19+
    Fixes: 724376a04d1a ("dm integrity: implement fair range locks")
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit eb40c0acdc342b815d4d03ae6abb09e80c0f2988 upstream.

    Some devices don't use blk_integrity but still want stable pages
    because they do their own checksumming. Examples include rbd and iSCSI
    when data digests are negotiated. Stacking DM (and thus LVM) on top of
    these devices results in sporadic checksum errors.

    Set BDI_CAP_STABLE_WRITES if any underlying device has it set.

    Cc: stable@vger.kernel.org
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit 75ae193626de3238ca5fb895868ec91c94e63b1b upstream.

    The limit was already incorporated to dm-crypt with commit 4e870e948fba
    ("dm crypt: fix error with too large bios"), so we don't need to apply
    it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
    wrong anyway because the variable ti->max_io_len it is supposed to be in
    the units of 512-byte sectors not in bytes.

    Reduction of the limit to 1048576 sectors could even cause data
    corruption in rare cases - suppose that we have a dm-striped device with
    stripe size 768MiB. The target will call dm_set_target_max_io_len with
    the value 1572864. The buggy code would reduce it to 1048576. Now, the
    dm-core will errorneously split the bios on 1048576-sector boundary
    insetad of 1572864-sector boundary and pass these stripe-crossing bios
    to the striped target.

    Cc: stable@vger.kernel.org # v4.16+
    Fixes: 8f50e358153d ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
    Signed-off-by: Mikulas Patocka
    Acked-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 0d74e6a3b6421d98eeafbed26f29156d469bc0b5 upstream.

    If the string opt_string is small, the function memcmp can access bytes
    that are beyond the terminating nul character. In theory, it could cause
    segfault, if opt_string were located just below some unmapped memory.

    Change from memcmp to strncmp so that we don't read bytes beyond the end
    of the string.

    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

06 Apr, 2019

6 commits

  • [ Upstream commit 5b5fd3c94eef69dcfaa8648198e54c92e5687d6d ]

    Current code already uses d_strtoul_nonzero() to convert input string
    to an unsigned integer, to make sure writeback_rate_p_term_inverse
    won't be zero value. But overflow may happen when converting input
    string to an unsigned integer value by d_strtoul_nonzero(), then
    dc->writeback_rate_p_term_inverse can still be set to 0 even if the
    sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).

    If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
    dev-zero error in following code from __update_writeback_rate(),
    int64_t proportional_scaled =
    div_s64(error, dc->writeback_rate_p_term_inverse);

    This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
    limit the value range in [1, UINT_MAX]. Then the unsigned integer
    overflow and dev-zero error can be avoided.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit 596b5a5dd1bc2fa019fdaaae522ef331deef927f ]

    Currently sysfs_strtoul_clamp() is defined as,
    82 #define sysfs_strtoul_clamp(file, var, min, max) \
    83 do { \
    84 if (attr == &sysfs_ ## file) \
    85 return strtoul_safe_clamp(buf, var, min, max) \
    86 ?: (ssize_t) size; \
    87 } while (0)

    The problem is, if bit width of var is less then unsigned long, min and
    max may not protect var from integer overflow, because overflow happens
    in strtoul_safe_clamp() before checking min and max.

    To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
    effect, this patch adds an unsigned long variable, and uses it to macro
    strtoul_safe_clamp() to convert an unsigned long value in range defined
    by [min, max]. Then assign this value to var. By this method, if bit
    width of var is less than unsigned long, integer overflow won't happen
    before min and max are checking.

    Now sysfs_strtoul_clamp() can properly handle smaller data type like
    unsigned int, of cause min and max should be defined in range of
    unsigned int too.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit c3b75a2199cdbfc1c335155fe143d842604b1baa ]

    dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
    in type unsigned int, and convert from input string by d_strtoul(). The
    problem is d_strtoul() does not check valid range of the input, if
    4294967296 is written into sysfs file writeback_rate_i_term_inverse,
    an overflow of unsigned integer will happen and value 0 is set to
    dc->writeback_rate_i_term_inverse.

    In writeback.c:__update_writeback_rate(), there are following lines of
    code,
    integral_scaled = div_s64(dc->writeback_rate_integral,
    dc->writeback_rate_i_term_inverse);
    If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
    a div-zero error might be triggered in the above code.

    Therefore we need to add a range limitation in the sysfs interface,
    this is what this patch does, use sysfs_stroul_clamp() to replace
    d_strtoul() and restrict the input range in [1, UINT_MAX].

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit 8c27a3953e92eb0b22dbb03d599f543a05f9574e ]

    People may set sequential_cutoff of a cached device via sysfs file,
    but current code does not check input value overflow. E.g. if value
    4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
    is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
    will be 0. This is an unexpected behavior.

    This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
    input string to unsigned integer value, and limit its range in
    [0, UINT_MAX]. Then the input overflow can be fixed.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit a91fbda49f746119828f7e8ad0f0aa2ab0578f65 ]

    Cache set sysfs entry io_error_halflife is used to set c->error_decay.
    c->error_decay is in type unsigned int, and it is converted by
    strtoul_or_return(), therefore overflow to c->error_decay is possible
    for a large input value.

    This patch fixes the overflow by using strtoul_safe_clamp() to convert
    input string to an unsigned long value in range [0, UINT_MAX], then
    divides by 88 and set it to c->error_decay.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li
     
  • [ Upstream commit 70de2cbda8a5d788284469e755f8b097d339c240 ]

    Invoking dm_get_device() twice on the same device path with different
    modes is dangerous. Because in that case, upgrade_mode() will alloc a
    new 'dm_dev' and free the old one, which may be referenced by a previous
    caller. Dereferencing the dangling pointer will trigger kernel NULL
    pointer dereference.

    The following two cases can reproduce this issue. Actually, they are
    invalid setups that must be disallowed, e.g.:

    1. Creating a thin-pool with read_only mode, and the same device as
    both metadata and data.

    dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    ...
    Call Trace:
    new_read+0xfb/0x110 [dm_bufio]
    dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
    ? kmem_cache_alloc_trace+0x15c/0x1e0
    __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
    dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
    pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
    ? realloc_argv+0x50/0x70 [dm_mod]
    dm_table_add_target+0x14e/0x330 [dm_mod]
    table_load+0x122/0x2e0 [dm_mod]
    ? dev_status+0x40/0x40 [dm_mod]
    ctl_ioctl+0x1aa/0x3e0 [dm_mod]
    dm_ctl_ioctl+0xa/0x10 [dm_mod]
    do_vfs_ioctl+0xa2/0x600
    ? handle_mm_fault+0xda/0x200
    ? __do_page_fault+0x26c/0x4f0
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x150
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    2. Creating a external snapshot using the same thin-pool device.

    dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
    dmsetup message /dev/mapper/thinp 0 "create_thin 0"
    dmsetup create snap --table \
    "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    ...
    Call Trace:
    ? __alloc_pages_nodemask+0x13c/0x2e0
    retrieve_status+0xa5/0x1f0 [dm_mod]
    ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
    table_status+0x61/0xa0 [dm_mod]
    ctl_ioctl+0x1aa/0x3e0 [dm_mod]
    dm_ctl_ioctl+0xa/0x10 [dm_mod]
    do_vfs_ioctl+0xa2/0x600
    ksys_ioctl+0x60/0x90
    ? ksys_write+0x4f/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x150
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: Jason Cai (Xiang Feng)
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Jason Cai (Xiang Feng)
     

24 Mar, 2019

4 commits

  • commit dc7292a5bcb4c878b076fca2ac3fc22f81b8f8df upstream.

    In 'commit 752f66a75aba ("bcache: use REQ_PRIO to indicate bio for
    metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
    This assumption is not always correct, e.g. XFS uses REQ_META to mark
    metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
    does not cache metadata for XFS after the above commit.

    Thanks to Dave Chinner, he explains the difference between REQ_META and
    REQ_PRIO from view of file system developer. Here I quote part of his
    explanation from mailing list,
    REQ_META is used for metadata. REQ_PRIO is used to communicate to
    the lower layers that the submitter considers this IO to be more
    important that non REQ_PRIO IO and so dispatch should be expedited.

    IOWs, if the filesystem considers metadata IO to be more important
    that user data IO, then it will use REQ_PRIO | REQ_META rather than
    just REQ_META.

    Then it seems bios with REQ_META or REQ_PRIO should both be cached for
    performance optimation, because they are all probably low I/O latency
    demand by upper layer (e.g. file system).

    So in this patch, when we want to decide whether to bypass the cache,
    REQ_META and REQ_PRIO are both checked. Then both metadata and
    high priority I/O requests will be handled properly.

    Reported-by: Nix
    Signed-off-by: Coly Li
    Reviewed-by: Andre Noll
    Tested-by: Nix
    Cc: stable@vger.kernel.org
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Coly Li
     
  • commit e406f12dde1a8375d77ea02d91f313fb1a9c6aec upstream.

    mddev->sync_thread can be set to NULL on kzalloc failure downstream.
    The patch checks for such a scenario and frees allocated resources.

    Committer node:

    Added similar fix to raid5.c, as suggested by Guoqing.

    Cc: stable@vger.kernel.org # v3.16+
    Acked-by: Guoqing Jiang
    Signed-off-by: Aditya Pakki
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    Aditya Pakki
     
  • commit 9951379b0ca88c95876ad9778b9099e19a95d566 upstream.

    Some users see panics like the following when performing fstrim on a
    bcached volume:

    [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    [ 530.183928] #PF error: [normal kernel read fault]
    [ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
    [ 530.750887] Oops: 0000 [#1] SMP PTI
    [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
    [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
    [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
    [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 46 08 44 8b 56 0c 48
    8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
    [ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
    [ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
    [ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
    [ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
    [ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
    [ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
    [ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
    [ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
    [ 535.851759] Call Trace:
    [ 535.970308] ? mempool_alloc_slab+0x15/0x20
    [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
    [ 536.403399] blk_mq_make_request+0x97/0x4f0
    [ 536.607036] generic_make_request+0x1e2/0x410
    [ 536.819164] submit_bio+0x73/0x150
    [ 536.980168] ? submit_bio+0x73/0x150
    [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
    [ 537.391595] ? _cond_resched+0x1a/0x50
    [ 537.573774] submit_bio_wait+0x59/0x90
    [ 537.756105] blkdev_issue_discard+0x80/0xd0
    [ 537.959590] ext4_trim_fs+0x4a9/0x9e0
    [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
    [ 538.324087] ext4_ioctl+0xea4/0x1530
    [ 538.497712] ? _copy_to_user+0x2a/0x40
    [ 538.679632] do_vfs_ioctl+0xa6/0x600
    [ 538.853127] ? __do_sys_newfstat+0x44/0x70
    [ 539.051951] ksys_ioctl+0x6d/0x80
    [ 539.212785] __x64_sys_ioctl+0x1a/0x20
    [ 539.394918] do_syscall_64+0x5a/0x110
    [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    We have observed it where both:
    1) LVM/devmapper is involved (bcache backing device is LVM volume) and
    2) writeback cache is involved (bcache cache_mode is writeback)

    On one machine, we can reliably reproduce it with:

    # echo writeback > /sys/block/bcache0/bcache/cache_mode
    (not sure whether above line is required)
    # mount /dev/bcache0 /test
    # for i in {0..10}; do
    file="$(mktemp /test/zero.XXX)"
    dd if=/dev/zero of="$file" bs=1M count=256
    sync
    rm $file
    done
    # fstrim -v /test

    Observing this with tracepoints on, we see the following writes:

    fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1
    fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0

    Note the final one has different hit/bypass flags.

    This is because in should_writeback(), we were hitting a case where
    the partial stripe condition was returning true and so
    should_writeback() was returning true early.

    If that hadn't been the case, it would have hit the would_skip test, and
    as would_skip == s->iop.bypass == true, should_writeback() would have
    returned false.

    Looking at the git history from 'commit 72c270612bd3 ("bcache: Write out
    full stripes")', it looks like the idea was to optimise for raid5/6:

    * If a stripe is already dirty, force writes to that stripe to
    writeback mode - to help build up full stripes of dirty data

    To fix this issue, make sure that should_writeback() on a discard op
    never returns true.

    More details of debugging:
    https://www.spinics.net/lists/linux-bcache/msg06996.html

    Previous reports:
    - https://bugzilla.kernel.org/show_bug.cgi?id=201051
    - https://bugzilla.kernel.org/show_bug.cgi?id=196103
    - https://www.spinics.net/lists/linux-bcache/msg06885.html

    (Coly Li: minor modification to follow maximum 75 chars per line rule)

    Cc: Kent Overstreet
    Cc: stable@vger.kernel.org
    Fixes: 72c270612bd3 ("bcache: Write out full stripes")
    Signed-off-by: Daniel Axtens
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • commit 225557446856448039a9e495da37b72c20071ef2 upstream.

    When using dm-integrity underneath md-raid, some tests with raid
    auto-correction trigger large amounts of integrity failures - and all
    these failures print an error message. These messages can bring the
    system to a halt if the system is using serial console.

    Fix this by limiting the rate of error messages - it improves the speed
    of raid recovery and avoids the hang.

    Fixes: 7eada909bfd7a ("dm: add integrity target")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

19 Mar, 2019

1 commit

  • commit b761dcf1217760a42f7897c31dcb649f59b2333e upstream.

    In reshape_request it already adds len to sector_nr already. It's wrong to add len to
    sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
    at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
    corruption.

    Cc: stable@vger.kernel.org # v3.16+
    Signed-off-by: Xiao Ni
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    Xiao Ni
     

20 Feb, 2019

3 commits

  • commit 4ae280b4ee3463fa57bbe6eede26b97daff8a0f1 upstream.

    When provisioning a new data block for a virtual block, either because
    the block was previously unallocated or because we are breaking sharing,
    if the whole block of data is being overwritten the bio that triggered
    the provisioning is issued immediately, skipping copying or zeroing of
    the data block.

    When this bio completes the new mapping is inserted in to the pool's
    metadata by process_prepared_mapping(), where the bio completion is
    signaled to the upper layers.

    This completion is signaled without first committing the metadata. If
    the bio in question has the REQ_FUA flag set and the system crashes
    right after its completion and before the next metadata commit, then the
    write is lost despite the REQ_FUA flag requiring that I/O completion for
    this request must only be signaled after the data has been committed to
    non-volatile storage.

    Fix this by deferring the completion of overwrite bios, with the REQ_FUA
    flag set, until after the metadata has been committed.

    Cc: stable@vger.kernel.org
    Signed-off-by: Nikos Tsironis
    Acked-by: Joe Thornber
    Acked-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Nikos Tsironis
     
  • commit ff0c129d3b5ecb3df7c8f5e2236582bf745b6c5f upstream.

    bio_sectors() returns the value in the units of 512-byte sectors (no
    matter what the real sector size of the device). dm-crypt multiplies
    bio_sectors() by on_disk_tag_size to calculate the space allocated for
    integrity tags. If dm-crypt is running with sector size larger than
    512b, it allocates more data than is needed.

    Device Mapper trims the extra space when passing the bio to
    dm-integrity, so this bug didn't result in any visible misbehavior.
    But it must be fixed to avoid wasteful memory allocation for the block
    integrity payload.

    Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
    Cc: stable@vger.kernel.org # 4.12+
    Reported-by: Milan Broz
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit dfcc34c99f3ebc16b787b118763bf9cb6b1efc7a upstream.

    sync_request_write no longer submits writes to a Faulty device. This has
    the unfortunate side effect that bitmap bits can be incorrectly cleared
    if a recovery is interrupted (previously, end_sync_write would have
    prevented this). This means the next recovery may not copy everything
    it should, potentially corrupting data.

    Add a function for doing the proper md_bitmap_end_sync, called from
    end_sync_write and the Faulty case in sync_request_write.

    backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
    Cc: stable@vger.kernel.org 4.14+
    Fixes: 0c9d5b127f69 ("md/raid1: avoid reusing a resync bio after error handling.")
    Reviewed-by: Jack Wang
    Tested-by: Jack Wang
    Signed-off-by: Nate Dailey
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    Nate Dailey
     

13 Feb, 2019

1 commit

  • [ Upstream commit e820d55cb99dd93ac2dc949cf486bb187e5cd70d ]

    When both regular IO and resync IO happen at the same time,
    and if we also need to split regular. Then we can see tasks
    hang due to barrier.

    1. resync thread
    [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
    [ 1463.757207] Not tainted 4.19.5-1-default #1
    [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 1463.757212] md1_resync D 0 5215 2 0x80000000
    [ 1463.757216] Call Trace:
    [ 1463.757223] ? __schedule+0x29a/0x880
    [ 1463.757231] ? raise_barrier+0x8d/0x140 [raid10]
    [ 1463.757236] schedule+0x78/0x110
    [ 1463.757243] raise_barrier+0x8d/0x140 [raid10]
    [ 1463.757248] ? wait_woken+0x80/0x80
    [ 1463.757257] raid10_sync_request+0x1f6/0x1e30 [raid10]
    [ 1463.757265] ? _raw_spin_unlock_irq+0x22/0x40
    [ 1463.757284] ? is_mddev_idle+0x125/0x137 [md_mod]
    [ 1463.757302] md_do_sync.cold.78+0x404/0x969 [md_mod]
    [ 1463.757311] ? wait_woken+0x80/0x80
    [ 1463.757336] ? md_rdev_init+0xb0/0xb0 [md_mod]
    [ 1463.757351] md_thread+0xe9/0x140 [md_mod]
    [ 1463.757358] ? _raw_spin_unlock_irqrestore+0x2e/0x60
    [ 1463.757364] ? __kthread_parkme+0x4c/0x70
    [ 1463.757369] kthread+0x112/0x130
    [ 1463.757374] ? kthread_create_worker_on_cpu+0x40/0x40
    [ 1463.757380] ret_from_fork+0x3a/0x50

    2. regular IO
    [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
    [ 1463.760683] Not tainted 4.19.5-1-default #1
    [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 1463.760687] kworker/0:8 D 0 5367 2 0x80000000
    [ 1463.760718] Workqueue: md submit_flushes [md_mod]
    [ 1463.760721] Call Trace:
    [ 1463.760731] ? __schedule+0x29a/0x880
    [ 1463.760741] ? wait_barrier+0xdd/0x170 [raid10]
    [ 1463.760746] schedule+0x78/0x110
    [ 1463.760753] wait_barrier+0xdd/0x170 [raid10]
    [ 1463.760761] ? wait_woken+0x80/0x80
    [ 1463.760768] raid10_write_request+0xf2/0x900 [raid10]
    [ 1463.760774] ? wait_woken+0x80/0x80
    [ 1463.760778] ? mempool_alloc+0x55/0x160
    [ 1463.760795] ? md_write_start+0xa9/0x270 [md_mod]
    [ 1463.760801] ? try_to_wake_up+0x44/0x470
    [ 1463.760810] raid10_make_request+0xc1/0x120 [raid10]
    [ 1463.760816] ? wait_woken+0x80/0x80
    [ 1463.760831] md_handle_request+0x121/0x190 [md_mod]
    [ 1463.760851] md_make_request+0x78/0x190 [md_mod]
    [ 1463.760860] generic_make_request+0x1c6/0x470
    [ 1463.760870] raid10_write_request+0x77a/0x900 [raid10]
    [ 1463.760875] ? wait_woken+0x80/0x80
    [ 1463.760879] ? mempool_alloc+0x55/0x160
    [ 1463.760895] ? md_write_start+0xa9/0x270 [md_mod]
    [ 1463.760904] raid10_make_request+0xc1/0x120 [raid10]
    [ 1463.760910] ? wait_woken+0x80/0x80
    [ 1463.760926] md_handle_request+0x121/0x190 [md_mod]
    [ 1463.760931] ? _raw_spin_unlock_irq+0x22/0x40
    [ 1463.760936] ? finish_task_switch+0x74/0x260
    [ 1463.760954] submit_flushes+0x21/0x40 [md_mod]

    So resync io is waiting for regular write io to complete to
    decrease nr_pending (conf->barrier++ is called before waiting).
    The regular write io splits another bio after call wait_barrier
    which call nr_pending++, then the splitted bio would continue
    with raid10_write_request -> wait_barrier, so the splitted bio
    has to wait for barrier to be zero, then deadlock happens as
    follows.

    resync io regular io

    raise_barrier
    wait_barrier
    generic_make_request
    wait_barrier

    To resolve the issue, we need to call allow_barrier to decrease
    nr_pending before generic_make_request since regular IO is not
    issued to underlying devices, and wait_barrier is called again
    to ensure no internal IO happening.

    Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.")
    Reported-and-tested-by: Siniša Bandin
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin

    Guoqing Jiang
     

07 Feb, 2019

1 commit

  • commit 483cbbeddd5fe2c80fd4141ff0748fa06c4ff146 upstream.

    This fixes the case when md array assembly fails because of raid cache recovery
    unable to allocate a stripe, despite attempts to replay stripes and increase
    cache size. This happens because stripes released by r5c_recovery_replay_stripes
    and raid5_set_cache_size don't become available for allocation immediately.
    Released stripes first are placed on conf->released_stripes list and require
    md thread to merge them on conf->inactive_list before they can be allocated.

    Patch allows final allocation attempt during cache recovery to wait for
    new stripes to become availabe for allocation.

    Cc: linux-raid@vger.kernel.org
    Cc: Shaohua Li
    Cc: linux-stable # 4.10+
    Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1")
    Signed-off-by: Alexei Naberezhnov
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    Alexei Naberezhnov
     

31 Jan, 2019

2 commits

  • commit 1856b9f7bcc8e9bdcccc360aabb56fbd4dd6c565 upstream.

    The dm-crypt cipher specification in a mapping table is defined as:
    cipher[:keycount]-chainmode-ivmode[:ivopts]
    or (new crypt API format):
    capi:cipher_api_spec-ivmode[:ivopts]

    For ESSIV, the parameter includes hash specification, for example:
    aes-cbc-essiv:sha256

    The implementation expected that additional IV option to never include
    another dash '-' character.

    But, with SHA3, there are names like sha3-256; so the mapping table
    parser fails:

    dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
    or (new crypt API format)
    dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"

    device-mapper: crypt: Ignoring unexpected additional cipher options
    device-mapper: table: 253:0: crypt: Error creating IV
    device-mapper: ioctl: error adding target to table

    Fix the dm-crypt constructor to ignore additional dash in IV options and
    also remove a bogus warning (that is ignored anyway).

    Cc: stable@vger.kernel.org # 4.12+
    Signed-off-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Milan Broz
     
  • commit d445bd9cec1a850c2100fcf53684c13b3fd934f2 upstream.

    Commit 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next
    stage processing") changed process_prepared_discard_passdown_pt1() to
    increment all the blocks being discarded until after the passdown had
    completed to avoid them being prematurely reused.

    IO issued to a thin device that breaks sharing with a snapshot, followed
    by a discard issued to snapshot(s) that previously shared the block(s),
    results in passdown_double_checking_shared_status() being called to
    iterate through the blocks double checking their reference count is zero
    and issuing the passdown if so. So a side effect of commit 00a0ea33b495
    is passdown_double_checking_shared_status() was broken.

    Fix this by checking if the block reference count is greater than 1.
    Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

    Fixes: 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next stage processing")
    Cc: stable@vger.kernel.org # 4.9+
    Reported-by: ryan.p.norwood@gmail.com
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

26 Jan, 2019

4 commits

  • [ Upstream commit ef87bfc24f9b8da82c89aff493df20f078bc9cb1 ]

    Reference to a device in device-mapper table contains offset in sectors.

    If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
    several device-mapper targets can overflow this offset and validity
    check is then performed on a wrong offset and a wrong table is activated.

    See for example (on 32bit without CONFIG_LBDAF) this overflow:

    # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
    # dmsetup table test
    0 2048 linear 8:96 1

    This patch adds explicit check for overflow if the offset is sector_t type.

    Signed-off-by: Milan Broz
    Reviewed-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Milan Broz
     
  • [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]

    kcopyd has no upper limit to the number of jobs one can allocate and
    issue. Under certain workloads this can lead to excessive memory usage
    and workqueue stalls. For example, when creating multiple dm-snapshot
    targets with a 4K chunk size and then writing to the origin through the
    page cache. Syncing the page cache causes a large number of BIOs to be
    issued to the dm-snapshot origin target, which itself issues an even
    larger (because of the BIO splitting taking place) number of kcopyd
    jobs.

    Running the following test, from the device mapper test suite [1],

    dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

    , with 8 active snapshots, results in the kcopyd job slab cache growing
    to 10G. Depending on the available system RAM this can lead to the OOM
    killer killing user processes:

    [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
    nodemask=(null), order=1, oom_score_adj=0
    [463.492894] kthreadd cpuset=/ mems_allowed=0
    [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
    [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [463.492952] Call Trace:
    [463.492964] dump_stack+0x7d/0xbb
    [463.492973] dump_header+0x6b/0x2fc
    [463.492987] ? lockdep_hardirqs_on+0xee/0x190
    [463.493012] oom_kill_process+0x302/0x370
    [463.493021] out_of_memory+0x113/0x560
    [463.493030] __alloc_pages_slowpath+0xf40/0x1020
    [463.493055] __alloc_pages_nodemask+0x348/0x3c0
    [463.493067] cache_grow_begin+0x81/0x8b0
    [463.493072] ? cache_grow_begin+0x874/0x8b0
    [463.493078] fallback_alloc+0x1e4/0x280
    [463.493092] kmem_cache_alloc_node+0xd6/0x370
    [463.493098] ? copy_process.part.31+0x1c5/0x20d0
    [463.493105] copy_process.part.31+0x1c5/0x20d0
    [463.493115] ? __lock_acquire+0x3cc/0x1550
    [463.493121] ? __switch_to_asm+0x34/0x70
    [463.493129] ? kthread_create_worker_on_cpu+0x70/0x70
    [463.493135] ? finish_task_switch+0x90/0x280
    [463.493165] _do_fork+0xe0/0x6d0
    [463.493191] ? kthreadd+0x19f/0x220
    [463.493233] kernel_thread+0x25/0x30
    [463.493235] kthreadd+0x1bf/0x220
    [463.493242] ? kthread_create_on_cpu+0x90/0x90
    [463.493248] ret_from_fork+0x3a/0x50
    [463.493279] Mem-Info:
    [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
    [463.493285] active_file:80216 inactive_file:80107 isolated_file:435
    [463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0
    [463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521
    [463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0
    [463.493285] free:33623 free_pcp:2392 free_cma:0
    ...
    [463.493489] Unreclaimable slab info:
    [463.493513] Name Used Total
    [463.493522] bio-6 1028KB 1028KB
    [463.493525] bio-5 1028KB 1028KB
    [463.493528] dm_snap_pending_exception 236783KB 243789KB
    [463.493531] dm_exception 41KB 42KB
    [463.493534] bio-4 1216KB 1216KB
    [463.493537] bio-3 439396KB 439396KB
    [463.493539] kcopyd_job 6973427KB 6973427KB
    ...
    [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
    [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
    [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Moreover, issuing a large number of kcopyd jobs results in kcopyd
    hogging the CPU, while processing them. As a result, processing of work
    items, queued for execution on the same CPU as the currently running
    kcopyd thread, is stalled for long periods of time, hurting performance.
    Running the aforementioned test we get, in dmesg, messages like the
    following:

    [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
    [67501.195586] Showing busy workqueues and worker pools:
    [67501.195591] workqueue events: flags=0x0
    [67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195611] pending: cache_reap
    [67501.195641] workqueue mm_percpu_wq: flags=0x8
    [67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195656] pending: vmstat_update
    [67501.195682] workqueue kblockd: flags=0x18
    [67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
    [67501.195698] pending: blk_timeout_work
    [67501.195753] workqueue kcopyd: flags=0x8
    [67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195768] pending: do_work [dm_mod]
    [67501.195802] workqueue kcopyd: flags=0x8
    [67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195817] pending: do_work [dm_mod]
    [67501.195834] workqueue kcopyd: flags=0x8
    [67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195848] pending: do_work [dm_mod]
    [67501.195881] workqueue kcopyd: flags=0x8
    [67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195896] pending: do_work [dm_mod]
    [67501.195920] workqueue kcopyd: flags=0x8
    [67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
    [67501.195935] in-flight: 67:do_work [dm_mod]
    [67501.195945] pending: do_work [dm_mod]
    [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

    The root cause for these issues is the way dm-snapshot uses kcopyd. In
    particular, the lack of an explicit or implicit limit to the maximum
    number of in-flight COW jobs. The merging path is not affected because
    it implicitly limits the in-flight kcopyd jobs to one.

    Fix these issues by using a semaphore to limit the maximum number of
    in-flight kcopyd jobs. We grab the semaphore before allocating a new
    kcopyd job in start_copy() and start_full_bio() and release it after the
    job finishes in copy_callback().

    The initial semaphore value is configurable through a module parameter,
    to allow fine tuning the maximum number of in-flight COW jobs. Setting
    this parameter to zero initializes the semaphore to INT_MAX.

    A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
    value was decided experimentally as a trade-off between memory
    consumption, stalling the kernel's workqueues and maintaining a high
    enough throughput.

    Re-running the aforementioned test:

    * Workqueue stalls are eliminated
    * kcopyd's job slab cache uses a maximum of 130MB
    * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

    [1] https://github.com/jthornber/device-mapper-test-suite

    Signed-off-by: Nikos Tsironis
    Signed-off-by: Ilias Tsitsimpis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit d7e6b8dfc7bcb3f4f3a18313581f67486a725b52 ]

    When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
    submitting copy jobs with a source size of 0, the jobs are pushed
    directly to the complete_jobs list, which could be under processing by
    the kcopyd thread. As a result, the kcopyd thread can continue running
    completed jobs indefinitely, without releasing the CPU, as long as
    someone keeps submitting new completed jobs through the aforementioned
    paths. Processing of work items, queued for execution on the same CPU as
    the currently running kcopyd thread, is thus stalled for excessive
    amounts of time, hurting performance.

    Running the following test, from the device mapper test suite [1],

    dmtest run --suite snapshot -n parallel_io_to_many_snaps_N

    , with 8 active snapshots, we get, in dmesg, messages like the
    following:

    [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
    [68899.949282] Showing busy workqueues and worker pools:
    [68899.949288] workqueue events: flags=0x0
    [68899.949295] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949306] pending: vmstat_shepherd, cache_reap
    [68899.949331] workqueue mm_percpu_wq: flags=0x8
    [68899.949337] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949345] pending: vmstat_update
    [68899.949387] workqueue dm_bufio_cache: flags=0x8
    [68899.949392] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
    [68899.949400] pending: work_fn [dm_bufio]
    [68899.949423] workqueue kcopyd: flags=0x8
    [68899.949429] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949437] pending: do_work [dm_mod]
    [68899.949452] workqueue kcopyd: flags=0x8
    [68899.949458] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949466] in-flight: 13:do_work [dm_mod]
    [68899.949474] pending: do_work [dm_mod]
    [68899.949487] workqueue kcopyd: flags=0x8
    [68899.949493] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949501] pending: do_work [dm_mod]
    [68899.949515] workqueue kcopyd: flags=0x8
    [68899.949521] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949529] pending: do_work [dm_mod]
    [68899.949541] workqueue kcopyd: flags=0x8
    [68899.949547] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949555] pending: do_work [dm_mod]
    [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084

    Fix this by splitting the complete_jobs list into two parts: A user
    facing part, named callback_jobs, and one used internally by kcopyd,
    retaining the name complete_jobs. dm_kcopyd_do_callback() and
    dispatch_job() now push their jobs to the callback_jobs list, which is
    spliced to the complete_jobs list once, every time the kcopyd thread
    wakes up. This prevents kcopyd from hogging the CPU indefinitely and
    causing workqueue stalls.

    Re-running the aforementioned test:

    * Workqueue stalls are eliminated
    * The maximum writing time among all targets is reduced from 09m37.10s
    to 06m04.85s and the total run time of the test is reduced from
    10m43.591s to 7m19.199s

    [1] https://github.com/jthornber/device-mapper-test-suite

    Signed-off-by: Nikos Tsironis
    Signed-off-by: Ilias Tsitsimpis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit 8d683dcd65c037efc9fb38c696ec9b65b306e573 ]

    The iv_offset in the mapping table of crypt target is a 64bit number when
    IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
    iv_offset of struct crypt_config, cc_sector of struct convert_context and
    iv_sector of struct dm_crypt_request. These structures members are defined
    as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
    kernel. In this situation sector_t is not big enough to store the 64bit
    iv_offset.

    Here is a reproducer.
    Prepare test image and device (loop is automatically allocated by cryptsetup):

    # dd if=/dev/zero of=tst.img bs=1M count=1
    # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
    --skip 500000000000000000 tst.img test

    On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
    and device checksum is wrong:

    # dmsetup table test --showkeys
    0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0

    # sha256sum /dev/mapper/test
    533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test

    On 64bit system (and on 32bit system with the patch), table and checksum is now correct:

    # dmsetup table test --showkeys
    0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0

    # sha256sum /dev/mapper/test
    5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test

    Signed-off-by: AliOS system security
    Tested-and-Reviewed-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    AliOS system security
     

20 Dec, 2018

4 commits

  • commit d57f9da890696af1484f4a47f7f123560197865a upstream.

    struct bioctx includes the ref refcount_t to track the number of I/O
    fragments used to process a target BIO as well as ensure that the zone
    of the BIO is kept in the active state throughout the lifetime of the
    BIO. However, since decrementing of this reference count is done in the
    target .end_io method, the function bio_endio() must be called multiple
    times for read and write target BIOs, which causes problems with the
    value of the __bi_remaining struct bio field for chained BIOs (e.g. the
    clone BIO passed by dm core is large and splits into fragments by the
    block layer), resulting in incorrect values and inconsistencies with the
    BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:

    BUG_ON(atomic_read(&bio->__bi_remaining)
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 89f5fa47476eda56402e29fff3c5097f5c2a1e19 upstream.

    Otherwise the incoming bios, of various types, won't be shaped based on
    the DM device's advertised limits.

    Depends-on: af67c31fba ("blk: remove bio_set arg from blk_queue_split()")
    Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 687cf4412a343a63928a5c9d91bdc0f522939d43 upstream.

    Otherwise dm_bitset_cursor_begin() return -ENODATA. Other calls to
    dm_bitset_cursor_begin() have similar negative checks.

    Fixes inability to create a cache in passthrough mode (even though doing
    so makes no sense).

    Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty")
    Cc: stable@vger.kernel.org
    Reported-by: David Teigland
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit f6c367585d0d851349d3a9e607c43e5bea993fa1 upstream.

    Sending a DM event before a thin-pool state change is about to happen is
    a bug. It wasn't realized until it became clear that userspace response
    to the event raced with the actual state change that the event was
    meant to notify about.

    Fix this by first updating internal thin-pool state to reflect what the
    DM event is being issued about. This fixes a long-standing racey/buggy
    userspace device-mapper-test-suite 'resize_io' test that would get an
    event but not find the state it was looking for -- so it would just go
    on to hang because no other events caused the test to reevaluate the
    thin-pool's state.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

14 Nov, 2018

10 commits

  • commit 9e753ba9b9b405e3902d9f08aec5f2ea58a0c317 upstream.

    Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear
    hotadd. Let's only fix the role for disks in raid1/10.
    Based on Guoqing's original patch.

    Reported-by: kernel test robot
    Cc: Gioh Kim
    Cc: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • commit 3d4e738311327bc4ba1d55fbe2f1da3de4a475f9 upstream.

    dmz_fetch_mblock() called from dmz_get_mblock() has a race since the
    allocation of the new metadata block descriptor and its insertion in
    the cache rbtree with the READING state is not atomic. Two different
    contexts requesting the same block may end up each adding two different
    descriptors of the same block to the cache.

    Another problem for this function is that the BIO for processing the
    block read is allocated after the metadata block descriptor is inserted
    in the cache rbtree. If the BIO allocation fails, the metadata block
    descriptor is freed without first being removed from the rbtree.

    Fix the first problem by checking again if the requested block is not in
    the cache right before inserting the newly allocated descriptor,
    atomically under the mblk_lock spinlock. The second problem is fixed by
    simply allocating the BIO before inserting the new block in the cache.

    Finally, since dmz_fetch_mblock() also increments a block reference
    counter, rename the function to dmz_get_mblock_slow(). To be symmetric
    and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and
    increment the block reference counter directly in that function rather
    than in dmz_get_mblock().

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 33c2865f8d011a2ca9f67124ddab9dc89382e9f1 upstream.

    Since the ref field of struct dmz_mblock is always used with the
    spinlock of struct dmz_metadata locked, there is no need to use an
    atomic_t type. Change the type of the ref field to an unsigne
    integer.

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 800a7340ab7dd667edf95e74d8e4f23a17e87076 upstream.

    In copy_params(), the struct 'dm_ioctl' is first copied from the user
    space buffer 'user' to 'param_kernel' and the field 'data_size' is
    checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload
    up to its 'data' member). If the check fails, an error code EINVAL will be
    returned. Otherwise, param_kernel->data_size is used to do a second copy,
    which copies from the same user-space buffer to 'dmi'. After the second
    copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'.
    Given that the buffer 'user' resides in the user space, a malicious
    user-space process can race to change the content in the buffer between
    the two copies. This way, the attacker can inject inconsistent data
    into 'dmi' (versus previously validated 'param_kernel').

    Fix redundant copying of 'minimum_data_size' from user-space buffer by
    using the first copy stored in 'param_kernel'. Also remove the
    'data_size' check after the second copy because it is now unnecessary.

    Cc: stable@vger.kernel.org
    Signed-off-by: Wenwen Wang
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Wenwen Wang
     
  • [ Upstream commit d595567dc4f0c1d90685ec1e2e296e2cad2643ac ]

    If we change the number of array's device after device is removed from array,
    then add the device back to array, we can see that device is added as active
    role instead of spare which we expected.

    Please see the below link for details:
    https://marc.info/?l=linux-raid&m=153736982015076&w=2

    This is caused by that we prefer to use device's previous role which is
    recorded by saved_raid_disk, but we should respect the new number of
    conf->raid_disks since it could be changed after device is removed.

    Reported-by: Gioh Kim
    Tested-by: Gioh Kim
    Acked-by: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • [ Upstream commit 6aaa58c994277647f8b05ffef3b9b225a2d08f36 ]

    I noticed kmemleak report memory leak when run create/stop
    md in a loop, backtrace:
    [] mempool_create_node+0x86/0xd0
    [] md_run+0x1057/0x1410 [md_mod]
    [] do_md_run+0x15/0x130 [md_mod]
    [] md_ioctl+0x1f49/0x25d0 [md_mod]
    [] blkdev_ioctl+0x680/0xd00

    The root cause is we alloc mddev->flush_pool and
    mddev->flush_bio_pool in md_run, but from do_md_stop
    will not call into md_stop but __md_stop, move the
    mempool_destroy to __md_stop fixes the problem for me.

    The bug was introduced in 5a409b4f56d5, the fixes should go to
    4.18+

    Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
    Signed-off-by: Jack Wang
    Reviewed-by: Xiao Ni
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jack Wang
     
  • [ Upstream commit af9b926de9c5986ab009e64917de87c9758bab10 ]

    flush_pool is leaked when flush bio size is zero

    Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
    Signed-off-by: David Jeffery
    Signed-off-by: Xiao Ni
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xiao Ni
     
  • [ Upstream commit 7567c2a2ad9e80a2ce977eef535e64b61899633e ]

    Forgot to include the maintainers with my first email.

    Somewhere between Michael Lyle's original
    "bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
    and 1d316e6 bcache: implement PI controller for writeback rate,
    the mapping of the writeback_rate_minimum attribute was dropped.

    Re-add the missing sysfs writeback_rate_minimum attribute mapping to
    "allow the user to specify a minimum rate at which dirty blocks are
    retired."

    Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
    Signed-off-by: Ben Peddell
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ben Peddell
     
  • commit 2d6cb6edd2c7fb4f40998895bda45006281b1ac5 upstream.

    refill->end record the last key of writeback, for example, at the first
    time, keys (1,128K) to (1,1024K) are flush to the backend device, but
    the end key (1,1024K) is not included, since the bellow code:
    if (bkey_cmp(k, refill->end) >= 0) {
    ret = MAP_DONE;
    goto out;
    }
    And in the next time when we refill writeback keybuf again, we searched
    key start from (1,1024K), and got a key bigger than it, so the key
    (1,1024K) missed.
    This patch modify the above code, and let the end key to be included to
    the writeback key buffer.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tang Junhui
     
  • commit 2e17a262a2371d38d2ec03614a2675a32cef9912 upstream.

    When bcache device is clean, dirty keys may still exist after
    journal replay, so we need to count these dirty keys even
    device in clean status, otherwise after writeback, the amount
    of dirty data would be incorrect.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tang Junhui