22 Apr, 2015

1 commit

  • v3: s-o-b comment, explanation of performance and descision for
    the start/stop implementation

    Implementing rmw functionality for RAID6 requires optimized syndrome
    calculation. Up to now we can only generate a complete syndrome. The
    target P/Q pages are always overwritten. With this patch we provide
    a framework for inplace P/Q modification. In the first place simply
    fill those functions with NULL values.

    xor_syndrome() has two additional parameters: start & stop. These
    will indicate the first and last page that are changing during a
    rmw run. That makes it possible to avoid several unneccessary loops
    and speed up calculation. The caller needs to implement the following
    logic to make the functions work.

    1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source
    blocks inside P/Q between (and including) start and end.

    2) modify any block with start
    Signed-off-by: NeilBrown

    Markus Stockhausen
     

11 Sep, 2013

1 commit

  • Pull md update from Neil Brown:
    "Headline item is multithreading for RAID5 so that more IO/sec can be
    supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6
    calculations and an assortment of bug fixes"

    * tag 'md/3.12' of git://neil.brown.name/md:
    raid5: only wakeup necessary threads
    md/raid5: flush out all pending requests before proceeding with reshape.
    md/raid5: use seqcount to protect access to shape in make_request.
    raid5: sysfs entry to control worker thread number
    raid5: offload stripe handle to workqueue
    raid5: fix stripe release order
    raid5: make release_stripe lockless
    md: avoid deadlock when dirty buffers during md_stop.
    md: Don't test all of mddev->flags at once.
    md: Fix apparent cut-and-paste error in super_90_validate
    raid6/test: replace echo -e with printf
    RAID: add tilegx SIMD implementation of raid6
    md: fix safe_mode buglet.
    md: don't call md_allow_write in get_bitmap_file.

    Linus Torvalds
     

27 Aug, 2013

1 commit

  • This change adds TILE-Gx SIMD instructions to the software raid
    (md), modeling the Altivec implementation. This is only for Syndrome
    generation; there is more that could be done to improve recovery,
    as in the recent Intel SSE3 recovery implementation.

    The code unrolls 8 times; this turns out to be the best on tilegx
    hardware among the set 1, 2, 4, 8 or 16. The code reads one
    cache-line of data from each disk, stores P and Q then goes to the
    next cache-line.

    The test code in sys/linux/lib/raid6/test reports 2008 MB/s data
    read rate for syndrome generation using 18 disks (16 data and 2
    parity). It was 1512 MB/s before this SIMD optimizations. This is
    running on 1 core with all the data in cache.

    This is based on the paper The Mathematics of RAID-6.
    (http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).

    Signed-off-by: Ken Steele
    Signed-off-by: Chris Metcalf
    Signed-off-by: NeilBrown

    Ken Steele
     

09 Jul, 2013

1 commit

  • Rebased/reworked a patch contributed by Rob Herring that uses
    NEON intrinsics to perform the RAID-6 syndrome calculations.
    It uses the existing unroll.awk code to generate several
    unrolled versions of which the best performing one is selected
    at boot time.

    Signed-off-by: Ard Biesheuvel
    Acked-by: Nicolas Pitre
    Cc: hpa@linux.intel.com

    Ard Biesheuvel
     

03 Jan, 2013

1 commit


13 Dec, 2012

2 commits


09 Oct, 2012

1 commit


22 May, 2012

1 commit


21 May, 2012

2 commits

  • When reshaping we can avoid costly intermediate backup by
    changing the 'start' address of the array on the device
    (if there is enough room).

    So as a first step, allow such a change to be requested
    through sysfs, and recorded in v1.x metadata.

    (As we didn't previous check that all 'pad' fields were zero,
    we need a new FEATURE flag for this.
    A (belatedly) check that all remaining 'pad' fields are
    zero to avoid a repeat of this)

    The new data offset must be requested separately for each device.
    This allows each to have a different change in the data offset.
    This is not likely to be used often but as data_offset can be
    set per-device, new_data_offset should be too.

    This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
    it is never used and never will be. At the same time we add a new
    arg ('in_new') which is currently always zero but will be used more
    soon.

    When a reshape finishes we will need to update the data_offset
    and rdev->sectors. So provide an exported function to do that.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently a reshape operation always progresses from the start
    of the array to the end unless the number of devices is being
    reduced, in which case it progressed in the opposite direction.

    To reverse a partial reshape which changes the number of devices
    you can stop the array and re-assemble with the raid-disks numbers
    reversed and it will undo.

    However for a reshape that does not change the number of devices
    it is not possible to reverse the reshape in the middle - you have to
    wait until it completes.

    So add a 'reshape_direction' attribute with is either 'forwards' or
    'backwards' and can be explicitly set when delta_disks is zero.

    This will become more important when we allow the data_offset to
    change in a reshape. Then the explicit statement of what direction is
    being used will be more useful.

    This can be enabled in raid5 trivially as it already supports
    reverse reshape and just needs to use a different trigger to request it.

    Signed-off-by: NeilBrown

    NeilBrown
     

13 Mar, 2012

1 commit


23 Dec, 2011

2 commits

  • hot-replace is a feature being added to md which will allow a
    device to be replaced without removing it from the array first.

    With hot-replace a spare can be activated and recovery can start while
    the original device is still in place, thus allowing a transition from
    an unreliable device to a reliable device without leaving the array
    degraded during the transition. It can also be use when the original
    device is still reliable but it not wanted for some reason.

    This will eventually be supported in RAID4/5/6 and RAID10.

    This patch adds a super-block flag to distinguish the replacement
    device. If an old kernel sees this flag it will reject the device.

    It also adds two per-device flags which are viewable and settable via
    sysfs.
    "want_replacement" can be set to request that a device be replaced.
    "replacement" is set to show that this device is replacing another
    device.

    The "rd%d" links in /sys/block/mdXx/md only apply to the original
    device, not the replacement. We currently don't make links for the
    replacement - there doesn't seem to be a need.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • While using etags to find free_pages(), I stumbled across this debug
    definition of free_pages() that is to be used while debugging some raid
    code in userspace. The __get_free_pages() allocates the correct size,
    but the free_pages() does not match. free_pages(), like
    __get_free_pages(), takes an order and not a size.

    Acked-by: H. Peter Anvin
    Signed-off-by: Steven Rostedt
    Signed-off-by: NeilBrown

    Steven Rostedt
     

28 Jul, 2011

1 commit

  • Space must have been allocated when array was created.
    A feature flag is set when the badblock list is non-empty, to
    ensure old kernels don't load and trust the whole device.

    We only update the on-disk badblocklist when it has changed.
    If the badblocklist (or other metadata) is stored on a bad block, we
    don't cope very well.

    If metadata has no room for bad block, flag bad-blocks as disabled,
    and do the same for 0.90 metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2011

1 commit


12 Aug, 2010

1 commit


14 Dec, 2009

1 commit


18 Jun, 2009

1 commit


31 Mar, 2009

7 commits

  • Move the raid6 data processing routines into a standalone module
    (raid6_pq) to prepare them to be called from async_tx wrappers and other
    non-md drivers/modules. This precludes a circular dependency of raid456
    needing the async modules for data processing while those modules in
    turn depend on raid456 for the base level synchronous raid6 routines.

    To support this move:
    1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
    2/ The raid6_call, recovery calls, and table symbols are exported
    3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
    compile

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • It really is nicer to keep related code together..

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This makes the includes more explicit, and is preparation for moving
    md_k.h to drivers/md/md.h

    Remove include/raid/md.h as its only remaining use was to #include
    other files.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The extern function definitions are kernel-internal definitions, so
    they belong in md_k.h

    The MD_*_VERSION values could reasonably go in a number of places,
    but md_u.h seems most reasonable.

    This leaves almost nothing in md.h. It will go soon.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • .. as they are part of the user-space interface.
    Also move MdpMinorShift into there so we can remove duplication.

    Lastly move mdp_major in. It is less obviously part of the user-space
    interface, but do_mounts_md.c uses it, and it is acting a bit like
    user-space.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Move the headers with the local structures for the disciplines and
    bitmap.h into drivers/md/ so that they are more easily grepable for
    hacking and not far away. md.h is left where it is for now as there
    are some uses from the outside.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: NeilBrown

    Christoph Hellwig
     
  • There are two problems with is_mddev_idle.

    1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
    rest are 'long'.
    So if sync_io were to wrap on a 64bit host, the value of
    curr_events would go very negative suddenly, and take a very
    long time to return to positive.

    So do all calculations as 'int'. That gives us plenty of precision
    for what we need.

    2/ To initialise rdev->last_events we simply call is_mddev_idle, on
    the assumption that it will make sure that last_events is in a
    suitable range. It used to do this, but now it does not.
    So now we need to be more explicit about initialisation.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Jan, 2009

1 commit


09 Jan, 2009

10 commits

  • If a raid1 has only one working drive and it has a sector which
    gives an error on read, then an attempt to recover onto a spare will
    fail, but as the single remaining drive is not removed from the
    array, the recovery will be immediately re-attempted, resulting
    in an infinite recovery loop.

    So detect this situation and don't retry recovery once an error
    on the lone remaining drive is detected.

    Allow recovery to be retried once every time a spare is added
    in case the problem wasn't actually a media error.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Using sequential numbers to identify md devices is somewhat artificial.
    Using names can be a lot more user-friendly.

    Also, creating md devices by opening the device special file is a bit
    awkward.

    So this patch provides a new option for creating and naming devices.

    Writing a name such as "md_home" to
    /sys/modules/md_mod/parameters/new_array
    will cause an array with that name to be created. It will appear in
    /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
    It will have an arbitrary minor number allocated.

    md devices that a created by an open are destroyed on the last
    close when the device is inactive.
    For named md devices, they will not be destroyed until the array
    is explicitly stopped, either with the STOP_ARRAY ioctl or by
    writing 'clear' to /sys/block/md_XXXX/md/array_state.

    The name of the array must start 'md_' to avoid conflict with
    other devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently md devices, once created, never disappear until the module
    is unloaded. This is essentially because the gendisk holds a
    reference to the mddev, and the mddev holds a reference to the
    gendisk, this a circular reference.

    If we drop the reference from mddev to gendisk, then we need to ensure
    that the mddev is destroyed when the gendisk is destroyed. However it
    is not possible to hook into the gendisk destruction process to enable
    this.

    So we drop the reference from the gendisk to the mddev and destroy the
    gendisk when the mddev gets destroyed. However this has a
    complication.
    Between the call
    __blkdev_get->get_gendisk->kobj_lookup->md_probe
    and the call
    __blkdev_get->md_open

    there is no obvious way to hold a reference on the mddev any more, so
    unless something is done, it will disappear and gendisk will be
    destroyed prematurely.

    Also, once we decide to destroy the mddev, there will be an unlockable
    moment before the gendisk is unlinked (blk_unregister_region) during
    which a new reference to the gendisk can be created. We need to
    ensure that this reference can not be used. i.e. the ->open must
    fail.

    So:
    1/ in md_probe we set a flag in the mddev (hold_active) which
    indicates that the array should be treated as active, even
    though there are no references, and no appearance of activity.
    This is cleared by md_release when the device is closed if it
    is no longer needed.
    This ensures that the gendisk will survive between md_probe and
    md_open.

    2/ In md_open we check if the mddev we expect to open matches
    the gendisk that we did open.
    If there is a mismatch we return -ERESTARTSYS and modify
    __blkdev_get to retry from the top in that case.
    In the -ERESTARTSYS sys case we make sure to wait until
    the old gendisk (that we succeeded in opening) is really gone so
    we loop at most once.

    Some udev configurations will always open an md device when it first
    appears. If we allow an md device that was just created by an open
    to disappear on an immediate close, then this can race with such udev
    configurations and result in an infinite loop the device being opened
    and closed, then re-open due to the 'ADD' even from the first open,
    and then close and so on.
    So we make sure an md device, once created by an open, remains active
    at least until some md 'ioctl' has been made on it. This means that
    all normal usage of md devices will allow them to disappear promptly
    when not needed, but the worst that an incorrect usage will do it
    cause an inactive md device to be left in existence (it can easily be
    removed).

    As an array can be stopped by writing to a sysfs attribute
    echo clear > /sys/block/mdXXX/md/array_state
    we need to use scheduled work for deleting the gendisk and other
    kobjects. This allows us to wait for any pending gendisk deletion to
    complete by simply calling flush_scheduled_work().

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
    with PRINT_RAID_DEBUG. it will dump out all in use md devices
    information;

    However, it wrongly processed two types of superblock in one:

    The header file has defined two types of superblock,
    struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
    metadata 0.90, and struct mdp_superblock_1 according to md with metadata
    1.0 and later,

    These two types of superblock are very different,

    The md_print_devices code processed them both in mdp_super_t, that would
    lead to wrong informaton dump like:

    [ 6742.345877]
    [ 6742.345887] md: **********************************
    [ 6742.345890] md: * *
    [ 6742.345892] md: **********************************
    [ 6742.345896] md1:
    [ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
    [ 6742.345909] md: rdev superblock:
    [ 6742.345914] md: SB: (V:0.90.0) ID: CT:4919856d
    [ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
    [ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
    [ 6742.345924] D 0: DISK
    [ 6742.345930] D 1: DISK
    [ 6742.345933] D 2: DISK
    [ 6742.345937] D 3: DISK
    [ 6742.345942] md: THIS: DISK
    ...
    [ 6742.346058] md0:
    [ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
    [ 6742.346070] md: rdev superblock:
    [ 6742.346073] md: SB: (V:1.0.0) ID: CT:9a322a9c
    [ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
    [ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
    [ 6742.346084] D 0: DISK
    [ 6742.346089] D 1: DISK
    [ 6742.346092] D 2: DISK
    [ 6742.346096] D 3: DISK
    [ 6742.346102] md: THIS: DISK
    ...
    [ 6742.346219] md: **********************************
    [ 6742.346221]

    Here md1 is metadata 0.90.0, and md0 is metadata 1.2

    After some more code to distinguish these two types of superblock, in this patch,

    it will generate dump information like:

    [ 7906.755790]
    [ 7906.755799] md: **********************************
    [ 7906.755802] md: * *
    [ 7906.755804] md: **********************************
    [ 7906.755808] md1:
    [ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
    [ 7906.755821] md: rdev superblock (MJ:0):
    [ 7906.755826] md: SB: (V:0.90.0) ID: CT:491989f3
    [ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
    [ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
    [ 7906.755836] D 0: DISK
    [ 7906.755842] D 1: DISK
    [ 7906.755845] D 2: DISK
    [ 7906.755849] D 3: DISK
    [ 7906.755855] md: THIS: DISK
    ...
    [ 7906.755972] md0:
    [ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
    [ 7906.755984] md: rdev superblock (MJ:1):
    [ 7906.755989] md: SB: (V:1) (F:0) Array-ID:
    [ 7906.755990] md: Name: "DG5:0" CT:1226410480
    [ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
    [ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
    [ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
    [ 7906.756003] md: (MaxDev:384)
    ...
    [ 7906.756113] md: **********************************
    [ 7906.756116]

    this md0 (metadata 1.2) information dumping is exactly according to struct
    mdp_superblock_1.

    Signed-off-by: Cheng Renquan
    Cc: Neil Brown
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: NeilBrown

    Cheng Renquan
     
  • The rdev_for_each macro defined in is identical to
    list_for_each_entry_safe, from , it should be defined to
    use list_for_each_entry_safe, instead of reinventing the wheel.

    But some calls to each_entry_safe don't really need a safe version,
    just a direct list_for_each_entry is enough, this could save a temp
    variable (tmp) in every function that used rdev_for_each.

    In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
    totally save many tmp vars; and only in the other situations that will call
    list_del to delete an entry, the safe version is used.

    Signed-off-by: Cheng Renquan
    Signed-off-by: NeilBrown

    Cheng Renquan
     
  • This patch renames the hash_spacing and preshift members of struct
    raid0_private_data to spacing and sector_shift respectively and
    changes the semantics as follows:

    We always have spacing = 2 * hash_spacing. In case
    sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
    while sector_shift = preshift = 0 otherwise.

    Note that the values of nb_zone and zone are unaffected by these changes
    because in the sector_div() preceeding the assignement of these two
    variables both arguments double.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This completes the block -> sector conversion of struct strip_zone.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • For the same reason as in the previous patch, rename it from zone_offset
    to zone_start.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Rename zone->dev_offset to zone->dev_start to make sure all users
    have been converted.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • There is no compelling need for this, but sysfs_notify_dirent is a
    nicer interface and the change is good for consistency.

    Signed-off-by: NeilBrown

    NeilBrown
     

21 Oct, 2008

2 commits


13 Oct, 2008

1 commit