Eric Lee / smarc-fsl-linux-kernel

14 Sep, 2019

1 commit

33f2c35a5 md: add feature flag MD_FEATURE_RAID0_LAYOUT ... Browse Code »

Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use. It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.

Acked-by: Guoqing Jiang
Signed-off-by: NeilBrown
Signed-off-by: Song Liu

NeilBrown
2019-09-14 04:10:06 +0800

04 Sep, 2019

1 commit

62f7b1989 md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone ... Browse Code »

Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu
Reviewed-by: NeilBrown
Signed-off-by: Guilherme G. Piccoli
Signed-off-by: Song Liu

Guilherme G. Piccoli
2019-09-04 05:49:28 +0800

28 Aug, 2019

2 commits

9d4b45d6a md: don't report active array_state until after revalidate_disk() completes. ... Browse Code »

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready. As soon as
it appear to be zero, fsck can be run. If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show(). This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared. This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice. This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown
Signed-off-by: Song Liu

NeilBrown
2019-08-28 03:36:37 +0800
480523fea md: only call set_in_sync() when it is expected to succeed. ... Browse Code »

Since commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.

So we should only call it when the cost is justified.

md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays). For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.

In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO. So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.

This measurably reduces mean and maximum IO submission latency during
resync/recovery.

Reported-and-tested-by: Jack Wang
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown
Signed-off-by: Song Liu

NeilBrown
2019-08-28 03:36:36 +0800

08 Aug, 2019

4 commits

0d8ed0e9b md: don't call spare_active in md_reap_sync_thread if all member devices can't work ... Browse Code »

When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-08-08 01:25:02 +0800
062f5b2ae md: don't set In_sync if array is frozen ... Browse Code »

When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
-> Manage_add
-> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-08-08 01:25:02 +0800
9a567843f md: allow last device to be forcibly removed from RAID1/RAID10. ... Browse Code »

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed. This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action. This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-08-08 01:25:02 +0800
cf8916079 md: Convert to use int_pow() ... Browse Code »

Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.

Signed-off-by: Andy Shevchenko
Signed-off-by: Song Liu

Andy Shevchenko
2019-08-08 01:25:02 +0800

15 Jul, 2019

1 commit

a1240cf74 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu ... Browse Code »

Pull percpu updates from Dennis Zhou:
"This includes changes to let percpu_ref release the backing percpu
memory earlier after it has been switched to atomic in cases where the
percpu ref is not revived.

This will help recycle percpu memory earlier in cases where the
refcounts are pinned for prolonged periods of time"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag

Linus Torvalds
2019-07-15 07:17:18 +0800

01 Jul, 2019

1 commit

5be1f9d82 Merge tag 'v5.2-rc6' into for-5.3/block ... Browse Code »

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...

Jens Axboe
2019-07-01 22:16:08 +0800

21 Jun, 2019

3 commits

d494549ac md: add bitmap_abort label in md_run ... Browse Code »

Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-06-21 07:36:00 +0800
963c555e7 md: introduce mddev_create/destroy_wb_pool for the change of member device ... Browse Code »

Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-06-21 07:36:00 +0800
3e148a320 md/raid1: fix potential data inconsistency issue with write behind device ... Browse Code »

For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Song Liu

Guoqing Jiang
2019-06-21 07:35:59 +0800

18 Jun, 2019

1 commit

9642fa73d md: fix for divide error in status_resync ... Browse Code »

Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).

The real problem occurs here after commit 72deb455b5ec ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.

Check if db value can be really counted and replace these macro by
div64_u64() inline.

Signed-off-by: Mariusz Tkaczyk
Signed-off-by: Song Liu

Mariusz Tkaczyk
2019-06-18 23:02:25 +0800

15 Jun, 2019

2 commits

e5b521ee9 md: fix spelling typo and add necessary space ... Browse Code »

This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.

Signed-off-by: Yufen Yu
Signed-off-by: Song Liu
Signed-off-by: Jens Axboe

Yufen Yu
2019-06-15 15:37:34 +0800
168b305b0 md: md.c: Return -ENODEV when mddev is NULL in rdev_attr_show ... Browse Code »

Commit c42d3240990814eec1e4b2b93fa0487fc4873aed
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.

Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Song Liu
Signed-off-by: Jens Axboe

Marcos Paulo de Souza
2019-06-15 15:37:34 +0800

24 May, 2019

1 commit

af1a8899d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Reviewed-by: Kate Stewart
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-24 23:27:13 +0800

10 May, 2019

1 commit

ddde2af74 md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT ... Browse Code »

Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.

Signed-off-by: Roman Gushchin
Acked-by: Tejun Heo
Signed-off-by: Dennis Zhou

Roman Gushchin
2019-05-10 01:50:59 +0800

17 Apr, 2019

1 commit

c42d32409 md: return -ENODEV if rdev has no mddev assigned ... Browse Code »

Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.

Reviewed-by: NeilBrown
Signed-off-by: Pawel Baldysiak
Signed-off-by: Song Liu

Pawel Baldysiak
2019-04-17 00:31:21 +0800

11 Apr, 2019

5 commits

2b598ee54 md: mark md_cluster_mod static ... Browse Code »

Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c. So just mark it static
and drop the export.

Acked-by: Guoqing Jiang
Signed-off-by: Christoph Hellwig
Signed-off-by: Song Liu

Christoph Hellwig
2019-04-11 06:26:09 +0800
ae50640be md: use correct type in super_1_sync ... Browse Code »

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig
Signed-off-by: Song Liu

Christoph Hellwig
2019-04-11 06:26:09 +0800
00485d094 md: use correct type in super_1_load ... Browse Code »

If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig
Signed-off-by: Song Liu

Christoph Hellwig
2019-04-11 06:26:08 +0800
ed4d0a4ea md: add a missing endianness conversion in check_sb_changes ... Browse Code »

The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.

Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode")
Cc: # 4.20+
Acked-by: Guoqing Jiang
Signed-off-by: Christoph Hellwig
Signed-off-by: Song Liu

Christoph Hellwig
2019-04-11 06:26:08 +0800
ee37e6219 md: add mddev->pers to avoid potential NULL pointer dereference ... Browse Code »

When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().

Fixes: a6da4ef85cef ("md: re-add a failed disk")
Cc: Xiao Ni
Cc: NeilBrown
Cc: # 4.4+
Reviewed-by: NeilBrown
Signed-off-by: Yufen Yu
Signed-off-by: Song Liu

Yufen Yu
2019-04-11 06:26:08 +0800

07 Apr, 2019

1 commit

72deb455b block: remove CONFIG_LBDAF ... Browse Code »

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-07 00:48:35 +0800

02 Apr, 2019

2 commits

2bc13b83e md: batch flush requests. ... Browse Code »

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 10.509 78.305
Flush 2078376 0.013 10.094
Close 21787697 0.019 18.821
LockX 96580 0.007 3.184
Mkdir 384 0.008 0.062
Rename 1255883 0.191 23.534
ReadX 46495589 0.020 14.230
WriteX 14790591 7.123 60.706
Unlink 5989118 0.440 54.551
UnlockX 96580 0.005 2.736
FIND_FIRST 10393845 0.042 12.079
SET_FILE_INFORMATION 2415558 0.129 10.088
QUERY_FILE_INFORMATION 4711725 0.005 8.462
QUERY_PATH_INFORMATION 26883327 0.032 21.715
QUERY_FS_INFORMATION 4929409 0.010 8.238
NTCreateX 29660080 0.100 53.268

Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 256 8.326 36.761
Flush 693291 3.974 180.269
Close 7266404 0.009 36.929
LockX 32160 0.006 0.840
Mkdir 128 0.008 0.021
Rename 418755 0.063 29.945
ReadX 15498708 0.007 7.216
WriteX 4932310 22.482 267.928
Unlink 1997557 0.109 47.553
UnlockX 32160 0.004 1.110
FIND_FIRST 3465791 0.036 7.320
SET_FILE_INFORMATION 805825 0.015 1.561
QUERY_FILE_INFORMATION 1570950 0.005 2.403
QUERY_PATH_INFORMATION 8965483 0.013 14.277
QUERY_FS_INFORMATION 1643626 0.009 3.314
NTCreateX 9892174 0.061 41.278

Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
max_latency=267.939 m

3. With patch1 and patch2
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 9.570 54.588
Flush 2061354 0.666 15.102
Close 21604811 0.012 25.697
LockX 95770 0.007 1.424
Mkdir 384 0.008 0.053
Rename 1245411 0.096 12.263
ReadX 46103198 0.011 12.116
WriteX 14667988 7.375 60.069
Unlink 5938936 0.173 30.905
UnlockX 95770 0.005 4.147
FIND_FIRST 10306407 0.041 11.715
SET_FILE_INFORMATION 2395987 0.048 7.640
QUERY_FILE_INFORMATION 4672371 0.005 9.291
QUERY_PATH_INFORMATION 26656735 0.018 19.719
QUERY_FS_INFORMATION 4887940 0.010 7.654
NTCreateX 29410811 0.059 28.551

Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
max_latency=60.075 ms

Cc: # v4.19+
Tested-by: Xiao Ni
Signed-off-by: NeilBrown
Signed-off-by: Song Liu
Signed-off-by: Jens Axboe

NeilBrown
2019-04-02 02:11:48 +0800
4bc034d35 Revert "MD: fix lock contention for flush bios" ... Browse Code »

This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
The bios thus submitted will be queued on current->bio_list and not
submitted immediately. As the bios are allocated from a mempool,
this can theoretically result in a deadlock - all the pool of requests
could be in various ->bio_list queues and a subsequent mempool_alloc
could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
It handles this by submitting many requests in parallel - all of which
are identical and so most of which do nothing useful.
It would be more efficient to just send one lower-level request, but
allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Cc: # v4.19+
Tested-by: Xiao Ni
Signed-off-by: NeilBrown
Signed-off-by: Song Liu
Signed-off-by: Jens Axboe

NeilBrown
2019-04-02 02:11:48 +0800

14 Jan, 2019

1 commit

6251691a9 md: Make bio_alloc_mddev use bio_alloc_bioset ... Browse Code »

bio_alloc_bioset returns a bio pointer or NULL, so we can avoid storing
the returned data into a new variable.

Acked-by: Guoqing Jiang
Acked-by: Artur Paszkiewicz
Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Jens Axboe

Marcos Paulo de Souza
2019-01-14 21:31:56 +0800

03 Jan, 2019

1 commit

dc629c211 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md into for-linus ... Browse Code »

Pull the pending 4.21 changes for md from Shaohua.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: fix raid10 hang issue caused by barrier
raid10: refactor common wait code from regular read/write request
md: remvoe redundant condition check
lib/raid6: add option to skip algo benchmarking
lib/raid6: sort algos in rough performance order
lib/raid6: check for assembler SSSE3 support
lib/raid6: avoid __attribute_const__ redefinition
lib/raid6: add missing include for raid6test
md: remove set but not used variable 'bi_rdev'

Jens Axboe
2019-01-03 23:21:02 +0800

21 Dec, 2018

2 commits

37b22c289 md: remvoe redundant condition check ... Browse Code »

mempool_destroy() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
mempool_destroy().

Signed-off-by: Chengguang Xu
Signed-off-by: Shaohua Li

Chengguang Xu
2018-12-21 00:53:24 +0800
f91389c8d md: remove set but not used variable 'bi_rdev' ... Browse Code »

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/md.c: In function 'md_integrity_add_rdev':
drivers/md/md.c:2149:24: warning:
variable 'bi_rdev' set but not used [-Wunused-but-set-variable]

It not used any more after commit
1501efadc524 ("md/raid: only permit hot-add of compatible integrity profiles")

Signed-off-by: Yue Haibing
Signed-off-by: Shaohua Li

Yue Haibing
2018-12-21 00:53:23 +0800

10 Dec, 2018

1 commit

112f158f6 block: stop passing 'cpu' to all percpu stats methods ... Browse Code »

All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2018-12-10 23:30:37 +0800

23 Oct, 2018

2 commits

af9b926de MD: Memory leak when flush bio size is zero ... Browse Code »

flush_pool is leaked when flush bio size is zero

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Signed-off-by: David Jeffery
Signed-off-by: Xiao Ni
Signed-off-by: Shaohua Li

Xiao Ni
2018-10-23 00:15:26 +0800
6aaa58c99 md: fix memleak for mempool ... Browse Code »

I noticed kmemleak report memory leak when run create/stop
md in a loop, backtrace:
[] mempool_create_node+0x86/0xd0
[] md_run+0x1057/0x1410 [md_mod]
[] do_md_run+0x15/0x130 [md_mod]
[] md_ioctl+0x1f49/0x25d0 [md_mod]
[] blkdev_ioctl+0x680/0xd00

The root cause is we alloc mddev->flush_pool and
mddev->flush_bio_pool in md_run, but from do_md_stop
will not call into md_stop but __md_stop, move the
mempool_destroy to __md_stop fixes the problem for me.

The bug was introduced in 5a409b4f56d5, the fixes should go to
4.18+

Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
Signed-off-by: Jack Wang
Reviewed-by: Xiao Ni
Signed-off-by: Shaohua Li

Jack Wang
2018-10-23 00:12:38 +0800

19 Oct, 2018

4 commits

cb9ee1543 md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted ... Browse Code »

We need to continue the reshaping if it was interrupted in
original node. So original node should call resync_bitmap
in case reshaping is aborted.

Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
node which continues the reshaping should restart reshape from
mddev->reshape_position instead of from the first beginning.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Shaohua Li

Guoqing Jiang
2018-10-19 00:40:48 +0800
ca1e98e04 md-cluster/raid10: don't call remove_and_add_spares during reshaping stage ... Browse Code »

remove_and_add_spares is not needed if reshape is
happening in another node, because raid10_add_disk
called inside raid10_start_reshape would handle the
role changes of disk. Plus, remove_and_add_spares
can't deal with the role change due to reshape.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Shaohua Li

Guoqing Jiang
2018-10-19 00:39:10 +0800
aefb2e5fc md-cluster/raid10: call update_size in md_reap_sync_thread ... Browse Code »

We need to change the capacity in all nodes after one node
finishs reshape. And as we did before, we can't change the
capacity directly in md_do_sync, instead, the capacity should
be only changed in update_size or received CHANGE_CAPACITY
msg.

So master node calls update_size after completes reshape in
md_reap_sync_thread, but we need to skip ops->update_size if
MD_CLOSING is set since reshaping could not be finish.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Shaohua Li

Guoqing Jiang
2018-10-19 00:38:06 +0800
7564beda1 md-cluster/raid10: support add disk under grow mode ... Browse Code »

For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.

Reviewed-by: NeilBrown
Signed-off-by: Guoqing Jiang
Signed-off-by: Shaohua Li

Guoqing Jiang
2018-10-19 00:34:56 +0800

15 Oct, 2018

1 commit

9e753ba9b MD: fix invalid stored role for a disk - try2 ... Browse Code »

Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.

Reported-by: kernel test robot
Cc: Gioh Kim
Cc: Guoqing Jiang
Signed-off-by: Shaohua Li

Shaohua Li
2018-10-15 08:05:07 +0800

04 Oct, 2018

1 commit

059421e04 md: allow metadata updates while suspending an array - fix ... Browse Code »

Commit 35bfc52187f6 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.

Unfortunately, the patch doesn't really work. The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.

This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.

Reported-by: Jeff Mahoney
Fixes: 35bfc52187f6 ("md: allow metadata update while suspending.")
Signed-off-by: NeilBrown
Signed-off-by: Shaohua Li

NeilBrown
2018-10-04 00:52:08 +0800