10 Dec, 2020

1 commit

  • This reverts commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359.

    Matthew Ruffell reported data corruption in raid10 due to the changes
    in discard handling [1]. Revert these changes before we find a proper fix.

    [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
    Cc: Matthew Ruffell
    Cc: Xiao Ni
    Signed-off-by: Song Liu

    Song Liu
     

25 Sep, 2020

1 commit

  • For far layout, the discard region is not continuous on disks. So it needs
    far copies r10bio to cover all regions. It needs a way to know all r10bios
    have finish or not. Similar with raid10_sync_request, only the first r10bio
    master_bio records the discard bio. Other r10bios master_bio record the
    first r10bio. The first r10bio can finish after other r10bios finish and
    then return the discard bio.

    Signed-off-by: Xiao Ni
    Signed-off-by: Song Liu

    Xiao Ni
     

14 May, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    sizeof(flexible-array-member) triggers a warning because flexible array
    members have incomplete type[1]. There are some instances of code in
    which the sizeof operator is being incorrectly/erroneously applied to
    zero-length arrays and the result is zero. Such instances may be hiding
    some bugs. So, this work (flexible-array member conversions) will also
    help to get completely rid of those sorts of issues.

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Song Liu

    Gustavo A. R. Silva
     

31 May, 2018

1 commit


19 Feb, 2018

1 commit

  • The rdev pointer kept in the local 'config' for each for
    raid1, raid10, raid4/5/6 has non-obvious lifetime rules.
    Sometimes RCU is needed, sometimes a lock, something nothing.

    Add documentation to explain this.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

15 Nov, 2017

1 commit

  • Pull MD update from Shaohua Li:
    "This update mostly includes bug fixes:

    - md-cluster now supports raid10 from Guoqing

    - raid5 PPL fixes from Artur

    - badblock regression fix from Bo

    - suspend hang related fixes from Neil

    - raid5 reshape fixes from Neil

    - raid1 freeze deadlock fix from Nate

    - memleak fixes from Zdenek

    - bitmap related fixes from Me and Tao

    - other fixes and cleanups"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
    md: free unused memory after bitmap resize
    md: release allocated bitset sync_set
    md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
    md: be cautious about using ->curr_resync_completed for ->recovery_offset
    badblocks: fix wrong return value in badblocks_set if badblocks are disabled
    md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
    md-cluster: update document for raid10
    md: remove redundant variable q
    raid1: remove obsolete code in raid1_write_request
    md-cluster: Use a small window for raid10 resync
    md-cluster: Suspend writes in RAID10 if within range
    md-cluster/raid10: set "do_balance = 0" if area is resyncing
    md: use lockdep_assert_held
    raid1: prevent freeze_array/wait_all_barriers deadlock
    md: use TASK_IDLE instead of blocking signals
    md: remove special meaning of ->quiesce(.., 2)
    md: allow metadata update while suspending.
    md: use mddev_suspend/resume instead of ->quiesce()
    md: move suspend_hi/lo handling into core md code
    md: don't call bitmap_create() while array is quiesced.
    ...

    Linus Torvalds
     

02 Nov, 2017

2 commits

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • Suspending the entire device for resync could take
    too long. Resync in small chunks.

    cluster's resync window is maintained in r10conf as
    cluster_sync_low and cluster_sync_high, and processed
    in raid10's sync_request(). If the current resync is
    outside the cluster resync window:

    1. Set the cluster_sync_low to curr_resync_completed.
    2. Set cluster_sync_high to cluster_sync_low + stripe
    size.
    3. Send a message to all nodes so they may add it in
    their suspension list.

    Note:
    We only support "near" raid10 so far, resync a far or
    offset raid10 array could have trouble. So raid10_run
    checks the layout of clustered raid10, it will refuse
    to run if the layout is not correct.

    With the "near" layout we process one stripe at a time
    progressing monotonically through the address space.
    So we can have a sliding window of whole-stripes which
    moves through the array suspending IO on other nodes,
    and both resync which uses array addresses and recovery
    which uses device addresses can stay within this window.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     

12 Apr, 2017

1 commit

  • raid10 splits requests in two different ways for two different
    reasons.

    First, bio_split() is used to ensure the bio fits with a chunk.
    Second, multiple r10bio structures are allocated to represent the
    different sections that need to go to different devices, to avoid
    known bad blocks.

    This can be simplified to just use bio_split() once, and not to use
    multiple r10bios.
    We delay the split until we know a maximum bio size that can
    be handled with a single r10bio, and then split the bio and queue
    the remainder for later handling.

    As with raid1, we allocate a new bio_set to help with the splitting.
    It is not correct to use fs_bio_set in a device driver.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

23 Nov, 2016

1 commit

  • If a device is marked FailFast, and it is not the only
    device we can read from, we mark the bio as MD_FAILFAST.

    If this does fail-fast, we don't try read repair but just
    allow failure.

    If it was the last device, it doesn't get marked Faulty so
    the retry happens on the same device - this time without
    FAILFAST. A subsequent failure will not retry but will just
    pass up the error.

    During resync we may use FAILFAST requests, and on a failure
    we will simply use the other device(s).

    During recovery we will only use FAILFAST in the unusual
    case were there are multiple places to read from - i.e. if
    there are > 2 devices. If we get a failure we will fail the
    device and complete the resync/recovery with remaining
    devices.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

20 Jul, 2016

1 commit

  • RAID10 random read performance is lower than expected due to excessive spinlock
    utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
    as it's in IO path and encounters a lot of unnecessary congestion.

    As lower_barrier just takes a lock in order to decrement a counter, convert
    counter (nr_pending) into atomic variable and remove the spin lock. There is
    also a congestion for wake_up (it uses lock internally) so call it only when
    it's really needed. As wake_up is not called constantly anymore, ensure process
    waiting to raise a barrier is notified when there are no more waiting IOs.

    Signed-off-by: Tomasz Majchrzak
    Signed-off-by: Shaohua Li

    Tomasz Majchrzak
     

01 Sep, 2015

1 commit

  • When a write to one of the legs of a RAID10 fails, the failure is
    recorded in the metadata of the other legs so that after a restart
    the data on the failed drive wont be trusted even if that drive seems
    to be working again (maybe a cable was unplugged).

    Currently there is no interlock between the write request completing
    and the metadata update. So it is possible that the write will
    complete, the app will confirm success in some way, and then the
    machine will crash before the metadata update completes.

    This is an extremely small hole for a racy to fit in, but it is
    theoretically possible and so should be closed.

    So:
    - set MD_CHANGE_PENDING when requesting a metadata update for a
    failed device, so we can know with certainty when it completes
    - queue requests that experienced an error on a new queue which
    is only processed after the metadata update completes
    - call raid_end_bio_io() on bios in that queue when the time comes.

    Signed-off-by: NeilBrown

    NeilBrown
     

04 Feb, 2015

1 commit

  • There is currently no locking around calls to the 'congested'
    bdi function. If called at an awkward time while an array is
    being converted from one level (or personality) to another, there
    is a tiny chance of running code in an unreferenced module etc.

    So add a 'congested' function to the md_personality operations
    structure, and call it with appropriate locking from a central
    'mddev_congested'.

    When the array personality is changing the array will be 'suspended'
    so no IO is processed.
    If mddev_congested detects this, it simply reports that the
    array is congested, which is a safe guess.
    As mddev_suspend calls synchronize_rcu(), mddev_congested can
    avoid races by included the whole call inside an rcu_read_lock()
    region.
    This require that the congested functions for all subordinate devices
    can be run under rcu_lock. Fortunately this is the case.

    Signed-off-by: NeilBrown

    NeilBrown
     

26 Feb, 2013

1 commit

  • The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
    widths - copying them to a different location on the same devices after
    shifting the stripe. An example layout of each follows below:

    "far" algorithm
    dev1 dev2 dev3 dev4 dev5 dev6
    ==== ==== ==== ==== ==== ====
    A B C D E F
    G H I J K L
    ...
    F A B C D E --> Copy of stripe0, but shifted by 1
    L G H I J K
    ...

    "offset" algorithm
    dev1 dev2 dev3 dev4 dev5 dev6
    ==== ==== ==== ==== ==== ====
    A B C D E F
    F A B C D E --> Copy of stripe0, but shifted by 1
    G H I J K L
    L G H I J K
    ...

    Redundancy for these algorithms is gained by shifting the copied stripes
    one device to the right. This patch proposes that array be divided into
    sets of adjacent devices and when the stripe copies are shifted, they wrap
    on set boundaries rather than the array size boundary. That is, for the
    purposes of shifting, the copies are confined to their sets within the
    array. The sets are 'near_copies * far_copies' in size.

    The above "far" algorithm example would change to:
    "far" algorithm
    dev1 dev2 dev3 dev4 dev5 dev6
    ==== ==== ==== ==== ==== ====
    A B C D E F
    G H I J K L
    ...
    B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
    H G J I L K Dev sets are 1-2, 3-4, 5-6
    ...

    This has the affect of improving the redundancy of the array. We can
    always sustain at least one failure, but sometimes more than one can
    be handled. In the first examples, the pairs of devices that CANNOT fail
    together are:
    (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
    In the example where the copies are confined to sets, the pairs of
    devices that cannot fail together are:
    (1,2) (3,4) (5,6) [20% of possible pairs]

    We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
    variable is used to indicate whether we use the old or new method of computing
    the shift. (This is similar to the way the 16th bit indicates whether the
    "far" algorithm or the "offset" algorithm is being used.)

    This patch only handles the cases where the number of total raid disks is
    a multiple of 'far_copies'. A follow-on patch addresses the condition where
    this is not true.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     

18 Aug, 2012

1 commit

  • A 'struct r10bio' has an array of per-copy information at the end.
    This array is declared with size [0] and r10bio_pool_alloc allocates
    enough extra space to store the per-copy information depending on the
    number of copies needed.

    So declaring a 'struct r10bio on the stack isn't going to work. It
    won't allocate enough space, and memory corruption will ensue.

    So in the two places where this is done, declare a sufficiently large
    structure and use that instead.

    The two call-sites of this bug were introduced in 3.4 and 3.5
    so this is suitable for both those kernels. The patch will have to
    be modified for 3.4 as it only has one bug.

    Cc: stable@vger.kernel.org
    Reported-by: Ivan Vasilyev
    Tested-by: Ivan Vasilyev
    Signed-off-by: NeilBrown

    NeilBrown
     

31 Jul, 2012

3 commits

  • md/raid10: Export is_congested test.

    In similar fashion to commits
    11d8a6e3719519fbc0e2c9d61b6fa931b84bf813
    1ed7242e591af7e233234d483f12d33818b189d9
    we export the RAID10 congestion checking function so that dm-raid.c can
    make use of it and make use of the personality. The 'queue' and 'gendisk'
    structures will not be available to the MD code when device-mapper sets
    up the device, so we conditionalize access to these fields also.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID1/RAID10: Move some macros from .h file to .c file

    There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
    in both raid1.h and raid10.h. They are only used in there respective .c files.
    However, if we wish to make RAID10 accessible to the device-mapper RAID
    target (dm-raid.c), then we need to move these macros into the .c files where
    they are used so that they do not conflict with each other.

    The macros from the two files are identical and could be moved into md.h, but
    I chose to leave the duplication and have them remain in the personality
    files.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'

    The same structure name ('mirror_info') is used by raid1. Each of these
    structures are defined in there respective header files. If dm-raid is
    to support both RAID1 and RAID10, the header files will be included and
    the structure names must not collide.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     

22 May, 2012

1 commit

  • A 'near' or 'offset' lay RAID10 array can be reshaped to a different
    'near' or 'offset' layout, a different chunk size, and a different
    number of devices.
    However the number of copies cannot change.

    Unlike RAID5/6, we do not support having user-space backup data that
    is being relocated during a 'critical section'. Rather, the
    data_offset of each device must change so that when writing any block
    to a new location, it will not over-write any data that is still
    'live'.

    This means that RAID10 reshape is not supportable on v0.90 metadata.

    The different between the old data_offset and the new_offset must be
    at least the larger of the chunksize multiplied by offset copies of
    each of the old and new layout. (for 'near' mode, offset_copies == 1).

    A larger difference of around 64M seems useful for in-place reshapes
    as more data can be moved between metadata updates.
    Very large differences (e.g. 512M) seem to slow the process down due
    to lots of long seeks (on oldish consumer graded devices at least).

    Metadata needs to be updated whenever the place we are about to write
    to is considered - by the current metadata - to still contain data in
    the old layout.

    [unbalanced locking fix from Dan Carpenter ]

    Signed-off-by: NeilBrown

    NeilBrown
     

21 May, 2012

2 commits

  • When RAID10 supports reshape it will need a 'previous' and a 'current'
    geometry, so introduce that here.
    Use the 'prev' geometry when before the reshape_position, and the
    current 'geo' when beyond it. At other times, use both as
    appropriate.

    For now, both are identical (And reshape_position is never set).

    When we use the 'prev' geometry, we must use the old data_offset.
    When we use the current (And a reshape is happening) we must use
    the new_data_offset.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We will shortly be adding reshape support for RAID10 which will
    require it having 2 concurrent geometries (before and after).
    To make that easier, collect most geometry fields into 'struct geom'
    and access them from there. Then we will more easily be able to add
    a second set of fields.

    Note that 'copies' is not in this struct and so cannot be changed.
    There is little need to change this number and doing so is a lot
    more difficult as it requires reallocating more things.
    So leave it out for now.

    Signed-off-by: NeilBrown

    NeilBrown
     

23 Dec, 2011

1 commit


11 Oct, 2011

7 commits


28 Jul, 2011

3 commits

  • When we get a write error (in the data area, not in metadata),
    update the badblock log rather than failing the whole device.

    As the write may well be many blocks, we trying writing each
    block individually and only log the ones which fail.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we succeed in writing to a block that was recorded as
    being bad, we clear the bad-block record.

    This requires some delayed handling as the bad-block-list update has
    to happen in process-context.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch just covers the basic read path:
    1/ read_balance needs to check for badblocks, and return not only
    the chosen slot, but also how many good blocks are available
    there.
    2/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio. This count is stored in 'bi_phys_segments'

    On read error we currently just fail the request if another target
    cannot handle the whole request. Next patch refines that a bit.

    Signed-off-by: NeilBrown

    NeilBrown
     

27 Jul, 2011

1 commit

  • When we get a read error during recovery, RAID10 previously
    arranged for the recovering device to appear to fail so that
    the recovery stops and doesn't restart. This is misleading and wrong.

    Instead, make use of the new recovery_disabled handling and mark
    the target device and having recovery disabled.

    Add appropriate checks in add_disk and remove_disk so that devices
    are removed and not re-added when recovery is disabled.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2011

1 commit


24 Jun, 2010

1 commit

  • Most array level changes leave the list of devices largely unchanged,
    possibly causing one at the end to become redundant.
    However conversions between RAID0 and RAID10 need to renumber
    all devices (except 0).

    This renumbering is currently being done in the ->run method when the
    new personality takes over. However this is too late as the common
    code in md.c might already have invalidated some of the devices if
    they had a ->raid_disk number that appeared to high.

    Moving it into the ->takeover method is too early as the array is
    still active at that time and wrong ->raid_disk numbers could cause
    confusion.

    So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
    the new raid_disk number.
    Now the common code knows exactly which devices need to be renumbered,
    and which can be invalidated, and can do it all at a convenient time
    when the array is suspend.
    It can also update some symlinks in sysfs which previously were not be
    updated correctly.

    Reported-by: Maciej Trela
    Signed-off-by: NeilBrown

    NeilBrown
     

18 May, 2010

1 commit


16 Jun, 2009

1 commit

  • Having a macro just to cast a void* isn't really helpful.
    I would must rather see that we are simply de-referencing ->private,
    than have to know what the macro does.

    So open code the macro everywhere and remove the pointless cast.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2009

2 commits