14 Sep, 2019

1 commit

  • Due to a bug introduced in Linux 3.14 we cannot determine the
    correctly layout for a multi-zone RAID0 array - there are two
    possibilities.

    It is possible to tell the kernel which to chose using a module
    parameter, but this can be clumsy to use. It would be best if
    the choice were recorded in the metadata.
    So add a feature flag for this purpose.
    If it is set, then the 'layout' field of the superblock is used
    to determine which layout to use.

    If this flag is not set, then mddev->layout gets set to -1,
    which causes the module parameter to be required.

    Acked-by: Guoqing Jiang
    Signed-off-by: NeilBrown
    Signed-off-by: Song Liu

    NeilBrown
     

04 Sep, 2019

1 commit

  • Currently md raid0/linear are not provided with any mechanism to validate
    if an array member got removed or failed. The driver keeps sending BIOs
    regardless of the state of array members, and kernel shows state 'clean'
    in the 'array_state' sysfs attribute. This leads to the following
    situation: if a raid0/linear array member is removed and the array is
    mounted, some user writing to this array won't realize that errors are
    happening unless they check dmesg or perform one fsync per written file.
    Despite udev signaling the member device is gone, 'mdadm' cannot issue the
    STOP_ARRAY ioctl successfully, given the array is mounted.

    In other words, no -EIO is returned and writes (except direct ones) appear
    normal. Meaning the user might think the wrote data is correctly stored in
    the array, but instead garbage was written given that raid0 does stripping
    (and so, it requires all its members to be working in order to not corrupt
    data). For md/linear, writes to the available members will work fine, but
    if the writes go to the missing member(s), it'll cause a file corruption
    situation, whereas the portion of the writes to the missing devices aren't
    written effectively.

    This patch changes this behavior: we check if the block device's gendisk
    is UP when submitting the BIO to the array member, and if it isn't, we flag
    the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
    request to the array requiring data from a valid member is still completed.
    While flagging the device as MD_BROKEN, we also show a rate-limited warning
    in the kernel log.

    A new array state 'broken' was added too: it mimics the state 'clean' in
    every aspect, being useful only to distinguish if the array has some member
    missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
    state. This state cannot be written in 'array_state' as it just shows
    one or more members of the array are missing but acts like 'clean', it
    wouldn't make sense to write it.

    With this patch, the filesystem reacts much faster to the event of missing
    array member: after some I/O errors, ext4 for instance aborts the journal
    and prevents corruption. Without this change, we're able to keep writing
    in the disk and after a machine reboot, e2fsck shows some severe fs errors
    that demand fixing. This patch was tested in ext4 and xfs filesystems, and
    requires a 'mdadm' counterpart to handle the 'broken' state.

    Cc: Song Liu
    Reviewed-by: NeilBrown
    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Song Liu

    Guilherme G. Piccoli
     

28 Aug, 2019

2 commits

  • Until revalidate_disk() has completed, the size of a new md array will
    appear to be zero.
    So we shouldn't report, through array_state, that the array is active
    until that time.
    udev rules check array_state to see if the array is ready. As soon as
    it appear to be zero, fsck can be run. If it find the size to be
    zero, it will fail.

    So add a new flag to provide an interlock between do_md_run() and
    array_state_show(). This flag is set while do_md_run() is active and
    it prevents array_state_show() from reporting that the array is
    active.

    Before do_md_run() is called, ->pers will be NULL so array is
    definitely not active.
    After do_md_run() is called, revalidate_disk() will have run and the
    array will be completely ready.

    We also move various sysfs_notify*() calls out of md_run() into
    do_md_run() after MD_NOT_READY is cleared. This ensure the
    information is ready before the notification is sent.

    Prior to v4.12, array_state_show() was called with the
    mddev->reconfig_mutex held, which provided exclusion with do_md_run().

    Note that MD_NOT_READY cleared twice. This is deliberate to cover
    both success and error paths with minimal noise.

    Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
    Cc: stable@vger.kernel.org (v4.12++)
    Signed-off-by: NeilBrown
    Signed-off-by: Song Liu

    NeilBrown
     
  • Since commit 4ad23a976413 ("MD: use per-cpu counter for
    writes_pending"), set_in_sync() is substantially more expensive: it
    can wait for a full RCU grace period which can be 10s of milliseconds.

    So we should only call it when the cost is justified.

    md_check_recovery() currently calls set_in_sync() every time it finds
    anything to do (on non-external active arrays). For an array
    performing resync or recovery, this will be quite often.
    Each call will introduce a delay to the md thread, which can noticeable
    affect IO submission latency.

    In md_check_recovery() we only need to call set_in_sync() if
    'safemode' was non-zero at entry, meaning that there has been not
    recent IO. So we save this "safemode was nonzero" state, and only
    call set_in_sync() if it was non-zero.

    This measurably reduces mean and maximum IO submission latency during
    resync/recovery.

    Reported-and-tested-by: Jack Wang
    Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
    Cc: stable@vger.kernel.org (v4.12+)
    Signed-off-by: NeilBrown
    Signed-off-by: Song Liu

    NeilBrown
     

08 Aug, 2019

4 commits

  • When add one disk to array, the md_reap_sync_thread is responsible
    to activate the spare and set In_sync flag for the new member in
    spare_active().

    But if raid1 has one member disk A, and disk B is added to the array.
    Then we offline A before all the datas are synchronized from A to B,
    obviously B doesn't have the latest data as A, but B is still marked
    with In_sync flag.

    So let's not call spare_active under the condition, otherwise B is
    still showed with 'U' state which is not correct.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • When a disk is added to array, the following path is called in mdadm.

    Manage_subdevs -> sysfs_freeze_array
    -> Manage_add
    -> sysfs_set_str(&info, NULL, "sync_action","idle")

    Then from kernel side, Manage_add invokes the path (add_new_disk ->
    validate_super = super_1_validate) to set In_sync flag.

    Since In_sync means "device is in_sync with rest of array", and the new
    added disk need to resync thread to help the synchronization of data.
    And md_reap_sync_thread would call spare_active to set In_sync for the
    new added disk finally. So don't set In_sync if array is in frozen.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • When the 'last' device in a RAID1 or RAID10 reports an error,
    we do not mark it as failed. This would serve little purpose
    as there is no risk of losing data beyond that which is obviously
    lost (as there is with RAID5), and there could be other sectors
    on the device which are readable, and only readable from this device.
    This in general this maximises access to data.

    However the current implementation also stops an admin from removing
    the last device by direct action. This is rarely useful, but in many
    case is not harmful and can make automation easier by removing special
    cases.

    Also, if an attempt to write metadata fails the device must be marked
    as faulty, else an infinite loop will result, attempting to update
    the metadata on all non-faulty devices.

    So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
    the 'last disk' checks for RAID1 and RAID10, and control the behavior
    per array by change sysfs node.

    Signed-off-by: NeilBrown
    [add sysfs node for fail_last_dev by Guoqing]
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • Instead of linear approach to calculate power of 10, use generic int_pow()
    which does it better.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Song Liu

    Andy Shevchenko
     

15 Jul, 2019

1 commit

  • Pull percpu updates from Dennis Zhou:
    "This includes changes to let percpu_ref release the backing percpu
    memory earlier after it has been switched to atomic in cases where the
    percpu ref is not revived.

    This will help recycle percpu memory earlier in cases where the
    refcounts are pinned for prolonged periods of time"

    * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
    percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
    md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
    io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
    percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag

    Linus Torvalds
     

01 Jul, 2019

1 commit

  • Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
    fix. Otherwise we end up having conflicts with future patches between
    for-5.3/block and master that touch this area. In particular, it makes
    the bio_full() fix hard to backport to stable.

    * tag 'v5.2-rc6': (482 commits)
    Linux 5.2-rc6
    Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
    Bluetooth: Fix regression with minimum encryption key size alignment
    tcp: refine memory limit test in tcp_fragment()
    x86/vdso: Prevent segfaults due to hoisted vclock reads
    SUNRPC: Fix a credential refcount leak
    Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
    net :sunrpc :clnt :Fix xps refcount imbalance on the error path
    NFS4: Only set creation opendata if O_CREAT
    ARM: 8867/1: vdso: pass --be8 to linker if necessary
    KVM: nVMX: reorganize initial steps of vmx_set_nested_state
    KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
    habanalabs: use u64_to_user_ptr() for reading user pointers
    nfsd: replace Jeff by Chuck as nfsd co-maintainer
    inet: clear num_timeout reqsk_alloc()
    PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
    net: mvpp2: debugfs: Add pmap to fs dump
    ipv6: Default fib6_type to RTN_UNICAST when not set
    net: hns3: Fix inconsistent indenting
    net/af_iucv: always register net_device notifier
    ...

    Jens Axboe
     

21 Jun, 2019

3 commits

  • Now, there are two places need to consider about
    the failure of destroy bitmap, so move the common
    part between bitmap_abort and abort label.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • Previously, we called rdev_init_wb to avoid potential data
    inconsistency when array is created.

    Now, we need to call the function and create mempool if a
    device is added or just be flaged as "writemostly". So
    mddev_create_wb_pool is introduced and called accordingly.
    And for safety reason, we mark implicit GFP_NOIO allocation
    scope for create mempool during mddev_suspend/mddev_resume.

    And mempool should be removed conversely after remove a
    member device or its's "writemostly" flag, which is done
    by call mddev_destroy_wb_pool.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     
  • For write-behind mode, we think write IO is complete once it has
    reached all the non-writemostly devices. It works fine for single
    queue devices.

    But for multiqueue device, if there are lots of IOs come from upper
    layer, then the write-behind device could issue those IOs to different
    queues, depends on the each queue's delay, so there is no guarantee
    that those IOs can arrive in order.

    To address the issue, we need to check the collision among write
    behind IOs, we can only continue without collision, otherwise wait
    for the completion of previous collisioned IO.

    And WBCollision is introduced for multiqueue device which is worked
    under write-behind mode.

    But this patch doesn't handle below cases which could have the data
    inconsistency issue as well, these cases will be handled in later
    patches.

    1. modify max_write_behind by write backlog node.
    2. add or remove array's bitmap dynamically.
    3. the change of member disk.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     

18 Jun, 2019

1 commit

  • Stopping external metadata arrays during resync/recovery causes
    retries, loop of interrupting and starting reconstruction, until it
    hit at good moment to stop completely. While these retries
    curr_mark_cnt can be small- especially on HDD drives, so subtraction
    result can be smaller than 0. However it is casted to uint without
    checking. As a result of it the status bar in /proc/mdstat while stopping
    is strange (it jumps between 0% and 99%).

    The real problem occurs here after commit 72deb455b5ec ("block: remove
    CONFIG_LBDAF"). Sector_div() macro has been changed, now the
    divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.

    Check if db value can be really counted and replace these macro by
    div64_u64() inline.

    Signed-off-by: Mariusz Tkaczyk
    Signed-off-by: Song Liu

    Mariusz Tkaczyk
     

15 Jun, 2019

2 commits


24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 or at your option any
    later version you should have received a copy of the gnu general
    public license for example usr src linux copying if not write to the
    free software foundation inc 675 mass ave cambridge ma 02139 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 20 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

10 May, 2019

1 commit

  • Percpu reference counters should now be initialized with the
    PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
    percpu mode from the atomic mode.
    To make percpu_ref_switch_to_percpu() call in set_in_sync()
    succeed,let's initialize percpu refcounters with the
    PERCU_REF_ALLOW_REINIT flag.

    Signed-off-by: Roman Gushchin
    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou

    Roman Gushchin
     

17 Apr, 2019

1 commit

  • Mdadm expects that setting drive as faulty will fail with -EBUSY only if
    this operation will cause RAID to be failed. If this happens, it will
    try to stop the array. Currently -EBUSY might also be returned if rdev
    is in the middle of the removal process - for example there is a race
    with mdmon that already requested the drive to be failed/removed.

    If rdev does not contain mddev, return -ENODEV instead, so the caller
    can distinguish between those two cases and behave accordingly.

    Reviewed-by: NeilBrown
    Signed-off-by: Pawel Baldysiak
    Signed-off-by: Song Liu

    Pawel Baldysiak
     

11 Apr, 2019

5 commits


07 Apr, 2019

1 commit

  • Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
    architectures. These types are required to support block device and/or
    file sizes larger than 2 TiB, and have generally defaulted to on for
    a long time. Enabling the option only increases the i386 tinyconfig
    size by 145 bytes, and many data structures already always use
    64-bit values for their in-core and on-disk data structures anyway,
    so there should not be a large change in dynamic memory usage either.

    Dropping this option removes a somewhat weird non-default config that
    has cause various bugs or compiler warnings when actually used.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Apr, 2019

2 commits

  • Currently if many flush requests are submitted to an md device is quick
    succession, they are serialized and can take a long to process them all.
    We don't really need to call flush all those times - a single flush call
    can satisfy all requests submitted before it started.
    So keep track of when the current flush started and when it finished,
    allow any pending flush that was requested before the flush started
    to complete without waiting any more.

    Test results from Xiao:

    Test is done on a raid10 device which is created by 4 SSDs. The tool is
    dbench.

    1. The latest linux stable kernel
    Operation Count AvgLat MaxLat
    --------------------------------------------------
    Deltree 768 10.509 78.305
    Flush 2078376 0.013 10.094
    Close 21787697 0.019 18.821
    LockX 96580 0.007 3.184
    Mkdir 384 0.008 0.062
    Rename 1255883 0.191 23.534
    ReadX 46495589 0.020 14.230
    WriteX 14790591 7.123 60.706
    Unlink 5989118 0.440 54.551
    UnlockX 96580 0.005 2.736
    FIND_FIRST 10393845 0.042 12.079
    SET_FILE_INFORMATION 2415558 0.129 10.088
    QUERY_FILE_INFORMATION 4711725 0.005 8.462
    QUERY_PATH_INFORMATION 26883327 0.032 21.715
    QUERY_FS_INFORMATION 4929409 0.010 8.238
    NTCreateX 29660080 0.100 53.268

    Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
    max_latency=60.712 ms

    2. With patch1 "Revert "MD: fix lock contention for flush bios""
    Operation Count AvgLat MaxLat
    --------------------------------------------------
    Deltree 256 8.326 36.761
    Flush 693291 3.974 180.269
    Close 7266404 0.009 36.929
    LockX 32160 0.006 0.840
    Mkdir 128 0.008 0.021
    Rename 418755 0.063 29.945
    ReadX 15498708 0.007 7.216
    WriteX 4932310 22.482 267.928
    Unlink 1997557 0.109 47.553
    UnlockX 32160 0.004 1.110
    FIND_FIRST 3465791 0.036 7.320
    SET_FILE_INFORMATION 805825 0.015 1.561
    QUERY_FILE_INFORMATION 1570950 0.005 2.403
    QUERY_PATH_INFORMATION 8965483 0.013 14.277
    QUERY_FS_INFORMATION 1643626 0.009 3.314
    NTCreateX 9892174 0.061 41.278

    Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
    max_latency=267.939 m

    3. With patch1 and patch2
    Operation Count AvgLat MaxLat
    --------------------------------------------------
    Deltree 768 9.570 54.588
    Flush 2061354 0.666 15.102
    Close 21604811 0.012 25.697
    LockX 95770 0.007 1.424
    Mkdir 384 0.008 0.053
    Rename 1245411 0.096 12.263
    ReadX 46103198 0.011 12.116
    WriteX 14667988 7.375 60.069
    Unlink 5938936 0.173 30.905
    UnlockX 95770 0.005 4.147
    FIND_FIRST 10306407 0.041 11.715
    SET_FILE_INFORMATION 2395987 0.048 7.640
    QUERY_FILE_INFORMATION 4672371 0.005 9.291
    QUERY_PATH_INFORMATION 26656735 0.018 19.719
    QUERY_FS_INFORMATION 4887940 0.010 7.654
    NTCreateX 29410811 0.059 28.551

    Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
    max_latency=60.075 ms

    Cc: # v4.19+
    Tested-by: Xiao Ni
    Signed-off-by: NeilBrown
    Signed-off-by: Song Liu
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85.

    This patch has two problems.

    1/ it make multiple calls to submit_bio() from inside a make_request_fn.
    The bios thus submitted will be queued on current->bio_list and not
    submitted immediately. As the bios are allocated from a mempool,
    this can theoretically result in a deadlock - all the pool of requests
    could be in various ->bio_list queues and a subsequent mempool_alloc
    could block waiting for one of them to be released.

    2/ It aims to handle a case when there are many concurrent flush requests.
    It handles this by submitting many requests in parallel - all of which
    are identical and so most of which do nothing useful.
    It would be more efficient to just send one lower-level request, but
    allow that to satisfy multiple upper-level requests.

    Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
    Cc: # v4.19+
    Tested-by: Xiao Ni
    Signed-off-by: NeilBrown
    Signed-off-by: Song Liu
    Signed-off-by: Jens Axboe

    NeilBrown
     

14 Jan, 2019

1 commit


03 Jan, 2019

1 commit

  • Pull the pending 4.21 changes for md from Shaohua.

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    md: fix raid10 hang issue caused by barrier
    raid10: refactor common wait code from regular read/write request
    md: remvoe redundant condition check
    lib/raid6: add option to skip algo benchmarking
    lib/raid6: sort algos in rough performance order
    lib/raid6: check for assembler SSSE3 support
    lib/raid6: avoid __attribute_const__ redefinition
    lib/raid6: add missing include for raid6test
    md: remove set but not used variable 'bi_rdev'

    Jens Axboe
     

21 Dec, 2018

2 commits

  • mempool_destroy() can handle NULL pointer correctly,
    so there is no need to check NULL pointer before calling
    mempool_destroy().

    Signed-off-by: Chengguang Xu
    Signed-off-by: Shaohua Li

    Chengguang Xu
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    drivers/md/md.c: In function 'md_integrity_add_rdev':
    drivers/md/md.c:2149:24: warning:
    variable 'bi_rdev' set but not used [-Wunused-but-set-variable]

    It not used any more after commit
    1501efadc524 ("md/raid: only permit hot-add of compatible integrity profiles")

    Signed-off-by: Yue Haibing
    Signed-off-by: Shaohua Li

    Yue Haibing
     

10 Dec, 2018

1 commit


23 Oct, 2018

2 commits

  • flush_pool is leaked when flush bio size is zero

    Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
    Signed-off-by: David Jeffery
    Signed-off-by: Xiao Ni
    Signed-off-by: Shaohua Li

    Xiao Ni
     
  • I noticed kmemleak report memory leak when run create/stop
    md in a loop, backtrace:
    [] mempool_create_node+0x86/0xd0
    [] md_run+0x1057/0x1410 [md_mod]
    [] do_md_run+0x15/0x130 [md_mod]
    [] md_ioctl+0x1f49/0x25d0 [md_mod]
    [] blkdev_ioctl+0x680/0xd00

    The root cause is we alloc mddev->flush_pool and
    mddev->flush_bio_pool in md_run, but from do_md_stop
    will not call into md_stop but __md_stop, move the
    mempool_destroy to __md_stop fixes the problem for me.

    The bug was introduced in 5a409b4f56d5, the fixes should go to
    4.18+

    Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios")
    Signed-off-by: Jack Wang
    Reviewed-by: Xiao Ni
    Signed-off-by: Shaohua Li

    Jack Wang
     

19 Oct, 2018

4 commits

  • We need to continue the reshaping if it was interrupted in
    original node. So original node should call resync_bitmap
    in case reshaping is aborted.

    Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
    node which continues the reshaping should restart reshape from
    mddev->reshape_position instead of from the first beginning.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     
  • remove_and_add_spares is not needed if reshape is
    happening in another node, because raid10_add_disk
    called inside raid10_start_reshape would handle the
    role changes of disk. Plus, remove_and_add_spares
    can't deal with the role change due to reshape.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     
  • We need to change the capacity in all nodes after one node
    finishs reshape. And as we did before, we can't change the
    capacity directly in md_do_sync, instead, the capacity should
    be only changed in update_size or received CHANGE_CAPACITY
    msg.

    So master node calls update_size after completes reshape in
    md_reap_sync_thread, but we need to skip ops->update_size if
    MD_CLOSING is set since reshaping could not be finish.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     
  • For clustered raid10 scenario, we need to let all the nodes
    know about that a new disk is added to the array, and the
    reshape caused by add new member just need to be happened in
    one node, but other nodes should know about the change.

    Since reshape means read data from somewhere (which is already
    used by array) and write data to unused region. Obviously, it
    is awful if one node is reading data from address while another
    node is writing to the same address. Considering we have
    implemented suspend writes in the resyncing area, so we can
    just broadcast the reading address to other nodes to avoid the
    trouble.

    For master node, it would call reshape_request then update sb
    during the reshape period. To avoid above trouble, we call
    resync_info_update to send RESYNC message in reshape_request.

    Then from slave node's view, it receives two type messages:
    1. RESYNCING message
    Slave node add the address (where master node reading data from)
    to suspend list.

    2. METADATA_UPDATED message
    Once slave nodes know the reshaping is started in master node,
    it is time to update reshape position and call start_reshape to
    follow master node's step. After reshape is done, only reshape
    position is need to be updated, so the majority task of reshaping
    is happened on the master node.

    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     

15 Oct, 2018

1 commit


04 Oct, 2018

1 commit

  • Commit 35bfc52187f6 ("md: allow metadata update while suspending.")
    added support for allowing md_check_recovery() to still perform
    metadata updates while the array is entering the 'suspended' state.
    This is needed to allow the processes of entering the state to
    complete.

    Unfortunately, the patch doesn't really work. The test for
    "mddev->suspended" at the start of md_check_recovery() means that the
    function doesn't try to do anything at all while entering suspend.

    This patch moves the code of updating the metadata while suspending to
    *before* the test on mddev->suspended.

    Reported-by: Jeff Mahoney
    Fixes: 35bfc52187f6 ("md: allow metadata update while suspending.")
    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown