Eric Lee / smarc-fsl-linux-kernel

19 Aug, 2018

1 commit

08b5fa819 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input ... Browse Code »

Pull input updates from Dmitry Torokhov:

- a new driver for Rohm BU21029 touch controller

- new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free

- updates to Atmel, eeti. pxrc and iforce drivers

- assorted driver cleanups and fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
MAINTAINERS: Add PhoenixRC Flight Controller Adapter
Input: do not use WARN() in input_alloc_absinfo()
Input: mark expected switch fall-throughs
Input: raydium_i2c_ts - use true and false for boolean values
Input: evdev - switch to bitmap API
Input: gpio-keys - switch to bitmap_zalloc()
Input: elan_i2c_smbus - cast sizeof to int for comparison
bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
md: Avoid namespace collision with bitmap API
dm: Avoid namespace collision with bitmap API
Input: pm8941-pwrkey - add resin entry
Input: pm8941-pwrkey - abstract register offsets and event code
Input: iforce - reorganize joystick configuration lists
Input: atmel_mxt_ts - move completion to after config crc is updated
Input: atmel_mxt_ts - don't report zero pressure from T9
Input: atmel_mxt_ts - zero terminate config firmware file
Input: atmel_mxt_ts - refactor config update code to add context struct
Input: atmel_mxt_ts - config CRC may start at T71
Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
...

Linus Torvalds
2018-08-19 07:48:07 +0800

18 Aug, 2018

1 commit

b0e5c2942 Merge tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

- A couple stable fixes for the DM writecache target.

- A stable fix for the DM cache target that fixes the potential for
data corruption after an unclean shutdown of a cache device using
writeback mode.

- Update DM integrity target to allow the metadata to be stored on a
separate device from data.

- Fix DM kcopyd and the snapshot target to cond_resched() where
appropriate and be more efficient with processing completed work.

- A few fixes and improvements for DM crypt.

- Add DM delay target feature to configure delay of flushes independent
of writes.

- Update DM thin-provisioning target to include metadata_low_watermark
threshold in pool status.

- Fix stale DM thin-provisioning Documentation.

* tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
dm writecache: fix a crash due to reading past end of dirty_bitmap
dm crypt: don't decrease device limits
dm cache metadata: set dirty on all cache blocks after a crash
dm snapshot: remove stale FIXME in snapshot_map()
dm snapshot: improve performance by switching out_of_order_list to rbtree
dm kcopyd: avoid softlockup in run_complete_job
dm cache metadata: save in-core policy_hint_size to on-disk superblock
dm thin: stop no_space_timeout worker when switching to write-mode
dm kcopyd: return void from dm_kcopyd_copy()
dm thin: include metadata_low_watermark threshold in pool status
dm writecache: report start_sector in status line
dm crypt: convert essiv from ahash to shash
dm crypt: use wake_up_process() instead of a wait queue
dm integrity: recalculate checksums on creation
dm integrity: flush journal on suspend when using separate metadata device
dm integrity: use version 2 for separate metadata
dm integrity: allow separate metadata device
dm integrity: add ic->start in get_data_sector()
dm integrity: report provided data sectors in the status
dm integrity: implement fair range locks
...

Linus Torvalds
2018-08-18 00:52:15 +0800

17 Aug, 2018

1 commit

1e1132ea2 dm writecache: fix a crash due to reading past end of dirty_bitmap ... Browse Code »

wc->dirty_bitmap_size is in bytes so must multiply it by 8, not by
BITS_PER_LONG, to get number of bitmap_bits.

Fixes crash in find_next_bit() that was reported:
https://bugzilla.kernel.org/show_bug.cgi?id=200819

Reported-by: edo.rus@gmail.com
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-08-17 01:43:01 +0800

15 Aug, 2018

1 commit

b219a1d2d Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md ... Browse Code »

Pull MD updates from Shaohua Li:
"A few MD fixes for 4.19-rc1:

- several md-cluster fixes from Guoqing

- a data corruption fix from BingJing

- other cleanups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid5: fix data corruption of replacements after originals dropped
drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() call
drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
md/r5cache: remove redundant pointer bio
md-cluster: don't send msg if array is closing
md-cluster: show array's status more accurate
md-cluster: clear another node's suspend_area after the copy is finished

Linus Torvalds
2018-08-15 02:03:16 +0800

14 Aug, 2018

1 commit

bc9e9cf04 dm crypt: don't decrease device limits ... Browse Code »

dm-crypt should only increase device limits, it should not decrease them.

This fixes a bug where the user could creates a crypt device with 1024
sector size on the top of scsi device that had 4096 logical block size.
The limit 4096 would be lost and the user could incorrectly send
1024-I/Os to the crypt device.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-08-14 03:28:41 +0800

11 Aug, 2018

1 commit

46451874c bcache: fix error setting writeback_rate through sysfs interface ... Browse Code »

Commit ea8c5356d390 ("bcache: set max writeback rate when I/O request
is idle") changes struct bch_ratelimit member rate from uint32_t to
atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
to set new writeback rate, after the input is converted from memory
buf to long int by sysfs_strtoul_clamp().

The above change has a problem because there is an implicit return
inside sysfs_strtoul_clamp() so the following atomic_long_set()
won't be called. This error is detected by 0day system with following
snipped smatch warnings:

drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
symbol 'v'.
270 sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@271 atomic_long_set(&dc->writeback_rate.rate, v);

This patch fixes the above error by using strtoul_safe_clamp() to
convert the input buffer into a long int type result.

Fixes: ea8c5356d390 ("bcache: set max writeback rate when I/O request is idle")
Cc: Kai Krakow
Cc: Stefan Priebe
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-11 02:18:47 +0800

10 Aug, 2018

1 commit

5b1fe7bec dm cache metadata: set dirty on all cache blocks after a crash ... Browse Code »

Quoting Documentation/device-mapper/cache.txt:

The 'dirty' state for a cache block changes far too frequently for us
to keep updating it on the fly. So we treat it as a hint. In normal
operation it will be written when the dm device is suspended. If the
system crashes all cache blocks will be assumed dirty when restarted.

This got broken in commit f177940a8091 ("dm cache metadata: switch to
using the new cursor api for loading metadata") in 4.9, which removed
the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
flag) when loading cache blocks. This results in data corruption on an
unclean shutdown with dirty cache blocks on the fast device. After the
crash those blocks are considered clean and may get evicted from the
cache at any time. This can be demonstrated by doing a lot of reads
to trigger individual evictions, but uncache is more predictable:

### Disable auto-activation in lvm.conf to be able to do uncache in
### time (i.e. see uncache doing flushing) when the fix is applied.

# xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
# vgcreate vg_cache /dev/vdb /dev/vdc
# lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
# lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
# lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
# lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
# lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
# xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
# xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
# dmsetup status vg_cache-lv_slowdev
0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
^^^^
7065 * 64k = 441M yet to be written to the slow device
# echo b >/proc/sysrq-trigger

# vgchange -ay vg_cache
# xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
# lvconvert --uncache vg_cache/lv_slowdev
Flushing 0 blocks for cache vg_cache/lv_slowdev.
Logical volume "lv_cachedev" successfully removed
Logical volume vg_cache/lv_slowdev is not cached.
# xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
0fe00000: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
0fe00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................

This is the case with both v1 and v2 cache pool metatata formats.

After applying this patch:

# vgchange -ay vg_cache
# xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
# lvconvert --uncache vg_cache/lv_slowdev
Flushing 3724 blocks for cache vg_cache/lv_slowdev.
...
Flushing 71 blocks for cache vg_cache/lv_slowdev.
Logical volume "lv_cachedev" successfully removed
Logical volume vg_cache/lv_slowdev is not cached.
# xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

Cc: stable@vger.kernel.org
Fixes: f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata")
Signed-off-by: Ilya Dryomov
Signed-off-by: Mike Snitzer

Ilya Dryomov
2018-08-10 00:14:32 +0800

09 Aug, 2018

11 commits

cbb751c06 bcache: trivial - remove tailing backslash in macro BTREE_FLAG ... Browse Code »

Remove the tailing backslash in macro BTREE_FLAG in btree.h

Signed-off-by: Shenghui Wang
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-08-09 22:21:19 +0800
e921efeb0 bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section ... Browse Code »

The pr_err statement in the code for sysfs_attatch section would run
for various error codes, which maybe confusing.

E.g,

Run the command twice:
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach
[the backing dev got attached on the first run]
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach

In dmesg, after the command run twice, we can get:
bcache: bch_cached_dev_attach() Can't attach sda6: already attached
bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be891
: cache set not found
The first statement in the message was right, but the second was
confusing.

bch_cached_dev_attach has various pr_ statements for various error
codes, except ENOENT.

After the change, rerun above command twice:
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach

In dmesg we only got:
bcache: bch_cached_dev_attach() Can't attach sda6: already attached
No confusing "cache set not found" message anymore.

And for some not exist SET-UUID:
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
/sys/block/bcache0/bcache/attach
In dmesg we can get:
bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be898
: cache set not found

Signed-off-by: Shenghui Wang
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-08-09 22:21:17 +0800
ea8c5356d bcache: set max writeback rate when I/O request is idle ... Browse Code »

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: stable@vger.kernel.org #4.16+
Signed-off-by: Coly Li
Tested-by: Kai Krakow
Tested-by: Stefan Priebe
Cc: Michael Lyle
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:15 +0800
b467a6ac0 bcache: add code comments for bset.c ... Browse Code »

This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:12 +0800
0cba2e711 bcache: fix mistaken comments in request.c ... Browse Code »

This patch updates code comment in bch_keylist_realloc() by fixing
incorrected function names, to make the code to be more comprehennsible.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:10 +0800
cb329dec1 bcache: fix mistaken code comments in bcache.h ... Browse Code »

This patch updates the code comment in struct cache with correct array
names, to make the code to be more comprehensible.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:09 +0800
e57fd7468 bcache: add a comment in super.c ... Browse Code »

This patch adds a line of code comment in super.c:register_bdev(), to
make code to be more comprehensible.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:07 +0800
c2e8dcf7f bcache: avoid unncessary cache prefetch bch_btree_node_get() ... Browse Code »

In bch_btree_node_get() the read-in btree node will be partially
prefetched into L1 cache for following bset iteration (if there is).
But if the btree node read is failed, the perfetch operations will
waste L1 cache space. This patch checkes whether read operation and
only does cache prefetch when read I/O succeeded.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:05 +0800
b4cb6efc1 bcache: display rate debug parameters to 0 when writeback is not running ... Browse Code »

When writeback is not running, writeback rate should be 0, other value is
misleading. And the following dyanmic writeback rate debug parameters
should be 0 too,
rate, proportional, integral, change
otherwise they are misleading when writeback is not running.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:03 +0800
78ac21071 bcache: do not check return value of debugfs_create_dir() ... Browse Code »

Greg KH suggests that normal code should not care about debugfs. Therefore
no matter successful or failed of debugfs_create_dir() execution, it is
unncessary to check its return value.

There are two functions called debugfs_create_dir() and check the return
value, which are bch_debug_init() and closure_debug_init(). This patch
changes these two functions from int to void type, and ignore return values
of debugfs_create_dir().

This patch does not fix exact bug, just makes things work as they should.

Signed-off-by: Coly Li
Suggested-by: Greg Kroah-Hartman
Cc: stable@vger.kernel.org
Cc: Kai Krakow
Cc: Kent Overstreet
Signed-off-by: Jens Axboe

Coly Li
2018-08-09 22:21:01 +0800
c9a5e6a96 dm snapshot: remove stale FIXME in snapshot_map() ... Browse Code »

Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
eliminated the need to worry about read vs write locking. So remove a
FIXME in snapshot_map() that is concerned about selectively taking a
write lock.

Signed-off-by: Mike Snitzer

Mike Snitzer
2018-08-09 08:50:58 +0800

08 Aug, 2018

4 commits

3db2776d9 dm snapshot: improve performance by switching out_of_order_list to rbtree ... Browse Code »

copy_complete()'s processing of out_of_order_list can result in
quadratic complexity in the worst case. As such it was the source of
consuming too much cpu and the source of significant loss in
performance.

Fix this by converting out_of_order_list to an rbtree. This improved
a dm-snapshot test copy workload from 32 seconds to 4 seconds.

Signed-off-by: David Jeffery
Signed-off-by: Mikulas Patocka
Tested-by: Brett Hull
Signed-off-by: Mike Snitzer

David Jeffery
2018-08-08 22:41:49 +0800
784c9a29e dm kcopyd: avoid softlockup in run_complete_job ... Browse Code »

It was reported that softlockups occur when using dm-snapshot ontop of
slow (rbd) storage. E.g.:

[ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
...
[ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
[ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
...
[ 4048.034190] Call Trace:
[ 4048.034196] ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
[ 4048.034200] run_complete_job+0x5f/0xb0 [dm_mod]
[ 4048.034205] process_jobs+0x91/0x220 [dm_mod]
[ 4048.034210] ? kcopyd_put_pages+0x40/0x40 [dm_mod]
[ 4048.034214] do_work+0x46/0xa0 [dm_mod]
[ 4048.034219] process_one_work+0x171/0x370
[ 4048.034221] worker_thread+0x1fc/0x3f0
[ 4048.034224] kthread+0xf8/0x130
[ 4048.034226] ? max_active_store+0x80/0x80
[ 4048.034227] ? kthread_bind+0x10/0x10
[ 4048.034231] ret_from_fork+0x35/0x40
[ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

Fix this by calling cond_resched() after run_complete_job()'s callout to
the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
trace).

Signed-off-by: John Pittman
Signed-off-by: Mike Snitzer

John Pittman
2018-08-08 21:16:24 +0800
fd2fa9541 dm cache metadata: save in-core policy_hint_size to on-disk superblock ... Browse Code »

policy_hint_size starts as 0 during __write_initial_superblock(). It
isn't until the policy is loaded that policy_hint_size is set in-core
(cmd->policy_hint_size). But it never got recorded in the on-disk
superblock because __commit_transaction() didn't deal with transfering
the in-core cmd->policy_hint_size to the on-disk superblock.

The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
__begin_transaction_flags() which re-reads all superblock fields.
Because the superblock's policy_hint_size was never properly stored, when
the cache was created, hints_array_available() would always return false
when re-activating a previously created cache. This means
__load_mappings() always considered the hints invalid and never made use
of the hints (these hints served to optimize).

Another detremental side-effect of this oversight is the cache_check
utility would fail with: "invalid hint width: 0"

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer

Mike Snitzer
2018-08-08 02:30:30 +0800
75294442d dm thin: stop no_space_timeout worker when switching to write-mode ... Browse Code »

Now both check_for_space() and do_no_space_timeout() will read & write
pool->pf.error_if_no_space. If these functions run concurrently, as
shown in the following case, the default setting of "queue_if_no_space"
can get lost.

precondition:
* error_if_no_space = false (aka "queue_if_no_space")
* pool is in Out-of-Data-Space (OODS) mode
* no_space_timeout worker has been queued

CPU 0: CPU 1:
// delete a thin device
process_delete_mesg()
// check_for_space() invoked by commit()
set_pool_mode(pool, PM_WRITE)
pool->pf.error_if_no_space = \
pt->requested_pf.error_if_no_space

// timeout, pool is still in OODS mode
do_no_space_timeout
// "queue_if_no_space" config is lost
pool->pf.error_if_no_space = true
pool->pf.mode = new_mode

Fix it by stopping no_space_timeout worker when switching to write mode.

Fixes: bcc696fac11f ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao
Signed-off-by: Mike Snitzer

Hou Tao
2018-08-08 02:30:29 +0800

06 Aug, 2018

1 commit

05b9ba4b5 Merge tag 'v4.18-rc6' into for-4.19/block2 ... Browse Code »

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe

Jens Axboe
2018-08-06 09:32:09 +0800

03 Aug, 2018

1 commit

d63e2fc80 md/raid5: fix data corruption of replacements after originals dropped ... Browse Code »

During raid5 replacement, the stripes can be marked with R5_NeedReplace
flag. Data can be read from being-replaced devices and written to
replacing spares without reading all other devices. (It's 'replace'
mode. s.replacing = 1) If a being-replaced device is dropped, the
replacement progress will be interrupted and resumed with pure recovery
mode. However, existing stripes before being interrupted cannot read
from the dropped device anymore. It prints lots of WARN_ON messages.
And it results in data corruption because existing stripes write
problematic data into its replacement device and update the progress.

\# Erase disks (1MB + 2GB)
dd if=/dev/zero of=/dev/sda bs=1MB count=2049
dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
\# Ensure array stores non-zero data
dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
\# Start replacement
mdadm /dev/md0 -a /dev/sdd
mdadm /dev/md0 --replace /dev/sda

Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
echo check > /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

Soon after you hot-plug out /dev/sda, you will see many WARN_ON
messages. The replacement recovery will be interrupted shortly. After
the recovery finishes, it will result in data corruption.

Actually, it's just an unhandled case of replacement. In commit
(md/raid5: fix interaction of 'replace' and 'recovery'.),
if a NeedReplace device is not UPTODATE then that is an error, the
commit just simply print WARN_ON but also mark these corrupted stripes
with R5_WantReplace. (it means it's ready for writes.)

To fix this case, we can leverage 'sync and replace' mode mentioned in
commit (md/raid5: detect and handle replacements during
recovery.). We can add logics to detect and use 'sync and replace' mode
for these stripes.

Reported-by: Alex Chen
Reviewed-by: Alex Wu
Reviewed-by: Chung-Chiang Cheng
Signed-off-by: BingJing Chang
Signed-off-by: Shaohua Li

BingJing Chang
2018-08-03 02:22:06 +0800

02 Aug, 2018

2 commits

e64e4018d md: Avoid namespace collision with bitmap API ... Browse Code »

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko
Acked-by: Shaohua Li
Signed-off-by: Dmitry Torokhov

Andy Shevchenko
2018-08-02 06:49:39 +0800
5cc9cdf63 dm: Avoid namespace collision with bitmap API ... Browse Code »

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand DM bitmap API is special case.
Adding 'dm' prefix to it to avoid potential name space collision.

No functional changes intended.

Suggested-by: Mike Snitzer
Signed-off-by: Andy Shevchenko
Acked-by: Mike Snitzer
Signed-off-by: Dmitry Torokhov

Andy Shevchenko
2018-08-02 06:49:38 +0800

01 Aug, 2018

1 commit

7209049d4 dm kcopyd: return void from dm_kcopyd_copy() ... Browse Code »

dm_kcopyd_copy() only ever returns 0 so there is no need for callers to
account for possible failure. Same goes for dm_kcopyd_zero().

Signed-off-by: Mike Snitzer

Mike Snitzer
2018-08-01 05:33:21 +0800

30 Jul, 2018

1 commit

63c8ecb62 dm thin: include metadata_low_watermark threshold in pool status ... Browse Code »

The metadata low watermark threshold is set by the kernel. But the
kernel depends on userspace to extend the thinpool metadata device when
the threshold is crossed.

Since the metadata low watermark threshold is not visible to userspace,
upon receiving an event, userspace cannot tell that the kernel wants the
metadata device extended, instead of some other eventing condition.
Making it visible (but not settable) enables userspace to affirmatively
know the kernel is asking for a metadata device extension, by comparing
metadata_low_watermark against nr_free_blocks_metadata, also reported in
status.

Current solutions like dmeventd have their own thresholds for extending
the data and metadata devices, and both devices are checked against
their thresholds on each event. This lessens the value of the kernel-set
threshold, since userspace will either extend the metadata device sooner,
when receiving another event; or will receive the metadata lowater event
and do nothing, if dmeventd's threshold is less than the kernel's.
(This second case is dangerous. The metadata lowater event will not be
re-sent, so no further event will be generated before the metadata
device is out if space, unless some other event causes userspace to
recheck its thresholds.)

Signed-off-by: Andy Grover
Signed-off-by: Mike Snitzer

Andy Grover
2018-07-30 23:49:08 +0800

28 Jul, 2018

12 commits

9ff07e7d6 dm writecache: report start_sector in status line ... Browse Code »

Fixes: d284f8248c7 ("dm writecache: support optional offset for start of device")
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:28:58 +0800
c07c88f54 dm crypt: convert essiv from ahash to shash ... Browse Code »

In preparing to remove all stack VLA usage from the kernel[1], remove
the discouraged use of AHASH_REQUEST_ON_STACK in favor of the smaller
SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash to direct
shash. The stack allocation will be made a fixed size in a later patch
to the crypto subsystem.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook
Reviewed-by: Eric Biggers
Signed-off-by: Mike Snitzer

Kees Cook
2018-07-28 03:24:28 +0800
c7329eff7 dm crypt: use wake_up_process() instead of a wait queue ... Browse Code »

This is a small simplification of dm-crypt - use wake_up_process()
instead of a wait queue in a case where only one process may be
waiting. dm-writecache uses a similar pattern.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:28 +0800
a3fcf7253 dm integrity: recalculate checksums on creation ... Browse Code »

When using external metadata device and internal hash, recalculate the
checksums when the device is created - so that dm-integrity doesn't
have to overwrite the device. The superblock stores the last position
when the recalculation ended, so that it is properly restarted.

Integrity tags that haven't been recalculated yet are ignored.

Also bump the target version.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:27 +0800
747829a8e dm integrity: flush journal on suspend when using separate metadata device ... Browse Code »

Flush the journal on suspend when using separate data and metadata devices,
so that the metadata device can be discarded and the table can be reloaded
with a linear target pointing to the data device.

NOTE: the journal is deliberately not flushed when using the same device
for metadata and data, so that the journal replay code is tested.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:26 +0800
1f9fc0b82 dm integrity: use version 2 for separate metadata ... Browse Code »

Use version "2" in the superblock when data and metadata devices are
separate, so that the device is not accidentally read by older kernel
version.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:25 +0800
356d9d52e dm integrity: allow separate metadata device ... Browse Code »

Add the ability to store DM integrity metadata on a separate device.
This feature is activated with the option "meta_device:/dev/device".

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:24 +0800
71e9ddbcb dm integrity: add ic->start in get_data_sector() ... Browse Code »

A small refactoring. Add the variable ic->start to the result
returned by get_data_sector() and not in the callers. This is a
prerequisite for the commit that adds the ability to use an external
metadata device.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:24 +0800
f84fd2c98 dm integrity: report provided data sectors in the status ... Browse Code »

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:23 +0800
724376a04 dm integrity: implement fair range locks ... Browse Code »

dm-integrity locks a range of sectors to prevent concurrent I/O or journal
writeback. These locks were not fair - so that many small overlapping I/Os
could starve a large I/O indefinitely.

Fix this by making the range locks fair. The ranges that are waiting are
added to the list "wait_list". If a new I/O overlaps some of the waiting
I/Os, it is not dispatched, but it is also added to that wait list.
Entries on the wait list are processed in first-in-first-out order, so
that an I/O can't starve indefinitely.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:22 +0800
518748b1a dm integrity: decouple common code in dm_integrity_map_continue() ... Browse Code »

Decouple how dm_integrity_map_continue() responds to being out of free
sectors and when add_new_range() fails.

This has no functional change, but helps prepare for the next commit.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:21 +0800
c21b16392 dm integrity: change 'suspending' variable from bool to int ... Browse Code »

Early alpha processors can't write a byte or short atomically - they
read 8 bytes, modify the byte or two bytes in registers and write back
8 bytes.

The modification of the variable "suspending" may race with
modification of the variable "failed". Fix this by changing
"suspending" to an int.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer

Mikulas Patocka
2018-07-28 03:24:20 +0800