28 Jul, 2011

1 commit

  • Space must have been allocated when array was created.
    A feature flag is set when the badblock list is non-empty, to
    ensure old kernels don't load and trust the whole device.

    We only update the on-disk badblocklist when it has changed.
    If the badblocklist (or other metadata) is stored on a bad block, we
    don't cope very well.

    If metadata has no room for bad block, flag bad-blocks as disabled,
    and do the same for 0.90 metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2011

1 commit


12 Aug, 2010

1 commit


14 Dec, 2009

1 commit


18 Jun, 2009

1 commit


31 Mar, 2009

7 commits

  • Move the raid6 data processing routines into a standalone module
    (raid6_pq) to prepare them to be called from async_tx wrappers and other
    non-md drivers/modules. This precludes a circular dependency of raid456
    needing the async modules for data processing while those modules in
    turn depend on raid456 for the base level synchronous raid6 routines.

    To support this move:
    1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
    2/ The raid6_call, recovery calls, and table symbols are exported
    3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
    compile

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • It really is nicer to keep related code together..

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This makes the includes more explicit, and is preparation for moving
    md_k.h to drivers/md/md.h

    Remove include/raid/md.h as its only remaining use was to #include
    other files.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The extern function definitions are kernel-internal definitions, so
    they belong in md_k.h

    The MD_*_VERSION values could reasonably go in a number of places,
    but md_u.h seems most reasonable.

    This leaves almost nothing in md.h. It will go soon.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • .. as they are part of the user-space interface.
    Also move MdpMinorShift into there so we can remove duplication.

    Lastly move mdp_major in. It is less obviously part of the user-space
    interface, but do_mounts_md.c uses it, and it is acting a bit like
    user-space.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Move the headers with the local structures for the disciplines and
    bitmap.h into drivers/md/ so that they are more easily grepable for
    hacking and not far away. md.h is left where it is for now as there
    are some uses from the outside.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: NeilBrown

    Christoph Hellwig
     
  • There are two problems with is_mddev_idle.

    1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the
    rest are 'long'.
    So if sync_io were to wrap on a 64bit host, the value of
    curr_events would go very negative suddenly, and take a very
    long time to return to positive.

    So do all calculations as 'int'. That gives us plenty of precision
    for what we need.

    2/ To initialise rdev->last_events we simply call is_mddev_idle, on
    the assumption that it will make sure that last_events is in a
    suitable range. It used to do this, but now it does not.
    So now we need to be more explicit about initialisation.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Jan, 2009

1 commit


09 Jan, 2009

10 commits

  • If a raid1 has only one working drive and it has a sector which
    gives an error on read, then an attempt to recover onto a spare will
    fail, but as the single remaining drive is not removed from the
    array, the recovery will be immediately re-attempted, resulting
    in an infinite recovery loop.

    So detect this situation and don't retry recovery once an error
    on the lone remaining drive is detected.

    Allow recovery to be retried once every time a spare is added
    in case the problem wasn't actually a media error.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Using sequential numbers to identify md devices is somewhat artificial.
    Using names can be a lot more user-friendly.

    Also, creating md devices by opening the device special file is a bit
    awkward.

    So this patch provides a new option for creating and naming devices.

    Writing a name such as "md_home" to
    /sys/modules/md_mod/parameters/new_array
    will cause an array with that name to be created. It will appear in
    /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
    It will have an arbitrary minor number allocated.

    md devices that a created by an open are destroyed on the last
    close when the device is inactive.
    For named md devices, they will not be destroyed until the array
    is explicitly stopped, either with the STOP_ARRAY ioctl or by
    writing 'clear' to /sys/block/md_XXXX/md/array_state.

    The name of the array must start 'md_' to avoid conflict with
    other devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently md devices, once created, never disappear until the module
    is unloaded. This is essentially because the gendisk holds a
    reference to the mddev, and the mddev holds a reference to the
    gendisk, this a circular reference.

    If we drop the reference from mddev to gendisk, then we need to ensure
    that the mddev is destroyed when the gendisk is destroyed. However it
    is not possible to hook into the gendisk destruction process to enable
    this.

    So we drop the reference from the gendisk to the mddev and destroy the
    gendisk when the mddev gets destroyed. However this has a
    complication.
    Between the call
    __blkdev_get->get_gendisk->kobj_lookup->md_probe
    and the call
    __blkdev_get->md_open

    there is no obvious way to hold a reference on the mddev any more, so
    unless something is done, it will disappear and gendisk will be
    destroyed prematurely.

    Also, once we decide to destroy the mddev, there will be an unlockable
    moment before the gendisk is unlinked (blk_unregister_region) during
    which a new reference to the gendisk can be created. We need to
    ensure that this reference can not be used. i.e. the ->open must
    fail.

    So:
    1/ in md_probe we set a flag in the mddev (hold_active) which
    indicates that the array should be treated as active, even
    though there are no references, and no appearance of activity.
    This is cleared by md_release when the device is closed if it
    is no longer needed.
    This ensures that the gendisk will survive between md_probe and
    md_open.

    2/ In md_open we check if the mddev we expect to open matches
    the gendisk that we did open.
    If there is a mismatch we return -ERESTARTSYS and modify
    __blkdev_get to retry from the top in that case.
    In the -ERESTARTSYS sys case we make sure to wait until
    the old gendisk (that we succeeded in opening) is really gone so
    we loop at most once.

    Some udev configurations will always open an md device when it first
    appears. If we allow an md device that was just created by an open
    to disappear on an immediate close, then this can race with such udev
    configurations and result in an infinite loop the device being opened
    and closed, then re-open due to the 'ADD' even from the first open,
    and then close and so on.
    So we make sure an md device, once created by an open, remains active
    at least until some md 'ioctl' has been made on it. This means that
    all normal usage of md devices will allow them to disappear promptly
    when not needed, but the worst that an incorrect usage will do it
    cause an inactive md device to be left in existence (it can easily be
    removed).

    As an array can be stopped by writing to a sysfs attribute
    echo clear > /sys/block/mdXXX/md/array_state
    we need to use scheduled work for deleting the gendisk and other
    kobjects. This allows us to wait for any pending gendisk deletion to
    complete by simply calling flush_scheduled_work().

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
    with PRINT_RAID_DEBUG. it will dump out all in use md devices
    information;

    However, it wrongly processed two types of superblock in one:

    The header file has defined two types of superblock,
    struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
    metadata 0.90, and struct mdp_superblock_1 according to md with metadata
    1.0 and later,

    These two types of superblock are very different,

    The md_print_devices code processed them both in mdp_super_t, that would
    lead to wrong informaton dump like:

    [ 6742.345877]
    [ 6742.345887] md: **********************************
    [ 6742.345890] md: * *
    [ 6742.345892] md: **********************************
    [ 6742.345896] md1:
    [ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
    [ 6742.345909] md: rdev superblock:
    [ 6742.345914] md: SB: (V:0.90.0) ID: CT:4919856d
    [ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
    [ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
    [ 6742.345924] D 0: DISK
    [ 6742.345930] D 1: DISK
    [ 6742.345933] D 2: DISK
    [ 6742.345937] D 3: DISK
    [ 6742.345942] md: THIS: DISK
    ...
    [ 6742.346058] md0:
    [ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
    [ 6742.346070] md: rdev superblock:
    [ 6742.346073] md: SB: (V:1.0.0) ID: CT:9a322a9c
    [ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
    [ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
    [ 6742.346084] D 0: DISK
    [ 6742.346089] D 1: DISK
    [ 6742.346092] D 2: DISK
    [ 6742.346096] D 3: DISK
    [ 6742.346102] md: THIS: DISK
    ...
    [ 6742.346219] md: **********************************
    [ 6742.346221]

    Here md1 is metadata 0.90.0, and md0 is metadata 1.2

    After some more code to distinguish these two types of superblock, in this patch,

    it will generate dump information like:

    [ 7906.755790]
    [ 7906.755799] md: **********************************
    [ 7906.755802] md: * *
    [ 7906.755804] md: **********************************
    [ 7906.755808] md1:
    [ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
    [ 7906.755821] md: rdev superblock (MJ:0):
    [ 7906.755826] md: SB: (V:0.90.0) ID: CT:491989f3
    [ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
    [ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
    [ 7906.755836] D 0: DISK
    [ 7906.755842] D 1: DISK
    [ 7906.755845] D 2: DISK
    [ 7906.755849] D 3: DISK
    [ 7906.755855] md: THIS: DISK
    ...
    [ 7906.755972] md0:
    [ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
    [ 7906.755984] md: rdev superblock (MJ:1):
    [ 7906.755989] md: SB: (V:1) (F:0) Array-ID:
    [ 7906.755990] md: Name: "DG5:0" CT:1226410480
    [ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
    [ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
    [ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
    [ 7906.756003] md: (MaxDev:384)
    ...
    [ 7906.756113] md: **********************************
    [ 7906.756116]

    this md0 (metadata 1.2) information dumping is exactly according to struct
    mdp_superblock_1.

    Signed-off-by: Cheng Renquan
    Cc: Neil Brown
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: NeilBrown

    Cheng Renquan
     
  • The rdev_for_each macro defined in is identical to
    list_for_each_entry_safe, from , it should be defined to
    use list_for_each_entry_safe, instead of reinventing the wheel.

    But some calls to each_entry_safe don't really need a safe version,
    just a direct list_for_each_entry is enough, this could save a temp
    variable (tmp) in every function that used rdev_for_each.

    In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
    totally save many tmp vars; and only in the other situations that will call
    list_del to delete an entry, the safe version is used.

    Signed-off-by: Cheng Renquan
    Signed-off-by: NeilBrown

    Cheng Renquan
     
  • This patch renames the hash_spacing and preshift members of struct
    raid0_private_data to spacing and sector_shift respectively and
    changes the semantics as follows:

    We always have spacing = 2 * hash_spacing. In case
    sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1
    while sector_shift = preshift = 0 otherwise.

    Note that the values of nb_zone and zone are unaffected by these changes
    because in the sector_div() preceeding the assignement of these two
    variables both arguments double.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This completes the block -> sector conversion of struct strip_zone.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • For the same reason as in the previous patch, rename it from zone_offset
    to zone_start.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Rename zone->dev_offset to zone->dev_start to make sure all users
    have been converted.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • There is no compelling need for this, but sysfs_notify_dirent is a
    nicer interface and the change is good for consistency.

    Signed-off-by: NeilBrown

    NeilBrown
     

21 Oct, 2008

2 commits


13 Oct, 2008

4 commits

  • Having
    function (args)
    instead of
    function(args)

    make is harder to search for calls of particular functions.
    So remove all those spaces.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • A lot of cruft has gathered over the years. Time to remove it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch renames hash_spacing and preshift to spacing and
    sector_shift respectively with the following change of semantics:

    Case 1: (sizeof(sector_t) sizeof(u32)).
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    (aka the shifting dance case). Here we have sector_shift = preshift +
    1 and

    spacing = 2 * hash_spacing

    during the computation of nb_zone and curr_sector, but

    spacing = hash_spacing

    in which_dev() because in the last hunk of the patch for linear.c we
    shift down conf->spacing (= 2 * hash_spacing) by one more bit than
    in the old code.

    Hence in the computation of nb_zone, sz and base have the same value
    as before, so nb_zone is not affected. Also curr_sector in the next
    hunk stays the same.

    In which_dev() the hash table index is computed as

    (sector >> sector_shift) / spacing

    In view of sector_shift = preshift + 1 and spacing = hash_spacing,
    this equals

    ((sector/2) >> preshift) / hash_spacing

    which is the value computed by the old code.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Rename them to num_sectors and start_sector which is more descriptive.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

24 Jul, 2008

1 commit


21 Jul, 2008

4 commits

  • All modifications and most access to the mddev->disks list are made
    under the reconfig_mutex lock. However there are three places where
    the list is walked without any locking. If a reconfig happens at this
    time, havoc (and oops) can ensue.

    So use RCU to protect these accesses:
    - wrap them in rcu_read_{,un}lock()
    - use list_for_each_entry_rcu
    - add to the list with list_add_rcu
    - delete from the list with list_del_rcu
    - delay the 'free' with call_rcu rather than schedule_work

    Note that export_rdev did a list_del_init on this list. In almost all
    cases the entry was not in the list anymore so it was a no-op and so
    safe. It is no longer safe as after list_del_rcu we may not touch
    the list_head.
    An audit shows that export_rdev is called:
    - after unbind_rdev_from_array, in which case the delete has
    already been done,
    - after bind_rdev_to_array fails, in which case the delete isn't needed.
    - before the device has been put on a list at all (e.g. in
    add_new_disk where reading the superblock fails).
    - and in autorun devices after a failure when the device is on a
    different list.

    So remove the list_del_init call from export_rdev, and add it back
    immediately before the called to export_rdev for that last case.

    Note also that ->same_set is sometimes used for lists other than
    mddev->list (e.g. candidates). In these cases rcu is not needed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Open isn't the only thing that increments ->active. e.g. reading
    /proc/mdstat will increment it briefly. So to avoid false positives
    in testing for concurrent access, introduce a new counter that counts
    just the number of times the md device it open.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This patch renames the array_size field of struct mddev_s to array_sectors
    and converts all instances to use units of 512 byte sectors instead of 1k
    blocks.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

11 Jul, 2008

2 commits


01 Jul, 2008

1 commit

  • md_allow_write() marks the metadata dirty while holding mddev->lock and then
    waits for the write to complete. For externally managed metadata this causes a
    deadlock as userspace needs to take the lock to communicate that the metadata
    update has completed.

    Change md_allow_write() in the 'external' case to start the 'mark active'
    operation and then return -EAGAIN. The expected side effects while waiting for
    userspace to write 'active' to 'array_state' are holding off reshape (code
    currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
    fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
    cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes
    these failures can be mitigated by coordinating with mdmon.

    md_write_start() still prevents writes from occurring until the metadata
    handler has had a chance to take action as it unconditionally waits for
    MD_CHANGE_CLEAN to be cleared.

    [neilb@suse.de: return -EAGAIN, try GFP_NOIO]
    Signed-off-by: Dan Williams

    Dan Williams
     

28 Jun, 2008

3 commits

  • From: Dan Williams

    Currently ops_run_biodrain and other locations have extra logic to determine
    which blocks are processed in the prexor and non-prexor cases. This can be
    eliminated if handle_write_operations5 flags the blocks to be processed in all
    cases via R5_Wantdrain. The presence of the prexor operation is tracked in
    sh->reconstruct_state.

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown

    Dan Williams
     
  • From: Dan Williams

    Track the state of reconstruct operations (recalculating the parity block
    usually due to incoming writes, or as part of array expansion) Reduces the
    scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
    a reconstruct operation has been requested via the ops_request field of struct
    stripe_head_state.

    This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
    the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
    not track the state of the operation.

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown

    Dan Williams
     
  • From: Dan Williams

    The STRIPE_OP_* flags record the state of stripe operations which are
    performed outside the stripe lock. Their use in indicating which
    operations need to be run is straightforward; however, interpolating what
    the next state of the stripe should be based on a given combination of
    these flags is not straightforward, and has led to bugs. An easier to read
    implementation with minimal degrees of freedom is needed.

    Towards this goal, this patch introduces explicit states to replace what was
    previously interpolated from the STRIPE_OP_* flags. For now this only converts
    the handle_parity_checks5 path, removing a user of the
    ops.{pending,ack,complete,count} fields of struct stripe_operations.

    This conversion also found a remaining issue with the current code. There is
    a small window for a drive to fail between when we schedule a repair and when
    the parity calculation for that repair completes. When this happens we will
    writeback to 'failed_num' when we really want to write back to 'pd_idx'.

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown

    Dan Williams