30 Aug, 2010

1 commit

  • MD_CHANGE_CLEAN is used for two different purposes and this leads to
    confusion.
    One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
    not used for anything else, so have MD_CHANGE_PENDING take over that
    purpose fully.

    The two purposes are:
    1/ tell md_update_sb that an update is needed and that it is just a
    clean/dirty transition.
    2/ tell user-space that an transition from clean to dirty is pending
    (something wants to write), and tell te kernel (by clearin the
    flag) that the transition is OK.

    The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
    fully to MD_CHANGE_PENDING.

    This means that various places which conditionally set or cleared
    MD_CHANGE_CLEAN no longer need to be conditional.

    Signed-off-by: NeilBrown

    NeilBrown
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://neil.brown.name/md: (24 commits)
    md: clean up do_md_stop
    md: fix another deadlock with removing sysfs attributes.
    md: move revalidate_disk() back outside open_mutex
    md/raid10: fix deadlock with unaligned read during resync
    md/bitmap: separate out loading a bitmap from initialising the structures.
    md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
    md/bitmap: optimise scanning of empty bitmaps.
    md/bitmap: clean up plugging calls.
    md/bitmap: reduce dependence on sysfs.
    md/bitmap: white space clean up and similar.
    md/raid5: export raid5 unplugging interface.
    md/plug: optionally use plugger to unplug an array during resync/recovery.
    md/raid5: add simple plugging infrastructure.
    md/raid5: export is_congested test
    raid5: Don't set read-ahead when there is no queue
    md: add support for raising dm events.
    md: export various start/stop interfaces
    md: split out md_rdev_init
    md: be more careful setting MD_CHANGE_CLEAN
    md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
    ...

    Linus Torvalds
     

08 Aug, 2010

2 commits

  • Move the deletion of sysfs attributes from reconfig_mutex to
    open_mutex didn't really help as a process can try to take
    open_mutex while holding reconfig_mutex, so the same deadlock can
    happen, just requiring one more process to be involved in the chain.

    I looks like I cannot easily use locking to wait for the sysfs
    deletion to complete, so don't.

    The only things that we cannot do while the deletions are still
    pending is other things which can change the sysfs namespace: run,
    takeover, stop. Each of these can fail with -EBUSY.
    So set a flag while doing a sysfs deletion, and fail run, takeover,
    stop if that flag is set.

    This is suitable for 2.6.35.x

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Jul, 2010

8 commits

  • This allows md/raid5 to fully work as a dm target.

    Normally md uses a 'filemap' which contains a list of pages of bits
    each of which may be written separately.
    dm-log uses and all-or-nothing approach to writing the log, so
    when using a dm-log, ->filemap is NULL and the flags normally stored
    in filemap_attr are stored in ->logattrs instead.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • 1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
    arrays with no queue attached.

    2/ Don't bother plugging the queue when we set a bit in the bitmap.
    The reason for this was to encourage as many bits as possible to
    get set before we unplug and write stuff out.
    However every personality already plugs the queue after
    bitmap_startwrite either directly (raid1/raid10) or be setting
    STRIPE_BIT_DELAY which causes the queue to be plugged later
    (raid5).

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Fixes some whitespace problems
    Fixed some checkpatch.pl complaints.
    Replaced kmalloc ... memset(0), with kzalloc
    Fixed an unlikely memory leak on an error path.
    Reformatted a number of 'if/else' sets, sometimes
    replacing goto with an else clause.
    Removed some old comments and commented-out code.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If an array doesn't have a 'queue' then md_do_sync cannot
    unplug it.
    In that case it will have a 'plugger', so make that available
    to the mddev, and use it to unplug the array if needed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md/raid5 uses the plugging infrastructure provided by the block layer
    and 'struct request_queue'. However when we plug raid5 under dm there
    is no request queue so we cannot use that.

    So create a similar infrastructure that is much lighter weight and use
    it for raid5.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • dm uses scheduled work to raise events to user-space.
    So allow md device to have work_structs and schedule them on an error.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • export entry points for starting and stopping md arrays.
    This will be used by a module to make md/raid5 work under
    dm.
    Also stop calling md_stop_writes from md_stop, as that won't
    work well with dm - it will want to call the two separately.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This functionality will be needed separately in a subsequent patch, so
    split it into it's own exported function.

    Signed-off-by: NeilBrown

    NeilBrown
     

21 Jul, 2010

1 commit


24 Jun, 2010

1 commit

  • Most array level changes leave the list of devices largely unchanged,
    possibly causing one at the end to become redundant.
    However conversions between RAID0 and RAID10 need to renumber
    all devices (except 0).

    This renumbering is currently being done in the ->run method when the
    new personality takes over. However this is too late as the common
    code in md.c might already have invalidated some of the devices if
    they had a ->raid_disk number that appeared to high.

    Moving it into the ->takeover method is too early as the array is
    still active at that time and wrong ->raid_disk numbers could cause
    confusion.

    So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
    the new raid_disk number.
    Now the common code knows exactly which devices need to be renumbered,
    and which can be invalidated, and can do it all at a convenient time
    when the array is suspend.
    It can also update some symlinks in sysfs which previously were not be
    updated correctly.

    Reported-by: Maciej Trela
    Signed-off-by: NeilBrown

    NeilBrown
     

18 May, 2010

5 commits

  • When updating the event count for a simple clean dirty transition,
    we try to avoid updating the spares so they can safely spin-down.
    As the event_counts across an array must be +/- 1, this means
    decrementing the event_count on a dirty->clean transition.
    This is not always safe and we have to avoid the unsafe time.
    We current do this with a misguided idea about it being safe or
    not depending on whether the event_count is odd or even. This
    approach only works reliably in a few common instances, but easily
    falls down.

    So instead, simply keep internal state concerning whether it is safe
    or not, and always assume it is not safe when an array is first
    assembled.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We used to pass the personality make_request function direct
    to the block layer so the first argument had to be a queue.
    But now we have the intermediary md_make_request so it makes
    at lot more sense to pass a struct mddev_s.
    It makes it possible to have an mddev without its own queue too.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We set ->changed to 1 and call check_disk_change at the end
    of md_open so that bd_invalidated would be set and thus
    partition rescan would happen appropriately.

    Now that we call revalidate_disk directly, which sets bd_invalidates,
    that indirection is no longer needed and can be removed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This was needed when sysfs files could only be 'notified'
    from process context. Now that we have sys_notify_direct,
    we can call it directly from an interrupt.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • These fields have never been used.
    commit 4b6d287f627b5fb6a49f78f9e81649ff98c62bb7
    added them, but also added identical files to bitmap_super_s,
    and only used the latter.

    So remove these unused fields.

    Signed-off-by: NeilBrown

    NeilBrown
     

17 May, 2010

1 commit

  • Some levels expect the 'redundancy group' to be present,
    others don't.
    So when we change level of an array we might need to
    add or remove this group.

    This requires fixing up the current practice of overloading ->private
    to indicate (when ->pers == NULL) that something needs to be removed.
    So create a new ->to_remove to fill that role.

    When changing levels, we may need to add or remove attributes. When
    changing RAID5 -> RAID6, we both add and remove the same thing. It is
    important to catch this and optimise it out as the removal is delayed
    until a lock is released, so trying to add immediately would cause
    problems.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

14 Dec, 2009

9 commits

  • We've noticed severe lasting performance degradation of our raid
    arrays when we have drives that yield large amounts of media errors.
    The raid10 module will queue each failed read for retry, and also
    will attempt call fix_read_error() to perform the read recovery.
    Read recovery is performed while the array is frozen, so repeated
    recovery attempts can degrade the performance of the array for
    extended periods of time.

    With this patch I propose adding a per md device max number of
    corrected read attempts. Each rdev will maintain a count of
    read correction attempts in the rdev->read_errors field (not
    used currently for raid10). When we enter fix_read_error()
    we'll check to see when the last read error occurred, and
    divide the read error count by 2 for every hour since the
    last read error. If at that point our read error count
    exceeds the read error threshold, we'll fail the raid device.

    In addition in this patch I add sysfs nodes (get/set) for
    the per md max_read_errors attribute, the rdev->read_errors
    attribute, and added some printk's to indicate when
    fix_read_error fails to repair an rdev.

    For testing I used debugfs->fail_make_request to inject
    IO errors to the rdev while doing IO to the raid array.

    Signed-off-by: Robert Becker
    Signed-off-by: NeilBrown

    Robert Becker
     
  • In this case, the metadata needs to not be in the same
    sector as the bitmap.
    md will not read/write any bitmap metadata. Config must be
    done via sysfs and when a recovery makes the array non-degraded
    again, writing 'true' to 'bitmap/can_clear' will allow bits in
    the bitmap to be cleared again.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • A new attribute directory 'bitmap' in 'md' is created which
    contains files for configuring the bitmap.
    'location' identifies where the bitmap is, either 'none',
    or 'file' or 'sector offset from metadata'.
    Writing 'location' can create or remove a bitmap.
    Adding a 'file' bitmap this way is not yet supported.
    'chunksize' and 'time_base' must be set before 'location'
    can be set.

    'chunksize' can be set before creating a bitmap, but is
    currently always over-ridden by the bitmap superblock.

    'time_base' and 'backlog' can be updated at any time.

    Signed-off-by: NeilBrown
    Reviewed-by: Andre Noll

    NeilBrown
     
  • safe_delay_store can parse fixed point numbers (for fractions
    of a second). We will want to do that for another sysfs
    file soon, so factor out the code.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • For md arrays were metadata is managed externally, the kernel does not
    know about a superblock so the superblock offset is 0.
    If we want to have a write-intent-bitmap near the end of the
    devices of such an array, we should support sector_t sized offset.
    We need offset be possibly negative for when the bitmap is before
    the metadata, so use loff_t instead.

    Also add sanity check that bitmap does not overlap with data.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • ... and into bitmap_info. These are all configuration parameters
    that need to be set before the bitmap is created.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • In preparation for making bitmap fields configurable via sysfs,
    start tidying up by making a single structure to contain the
    configuration fields.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Previously barriers were only supported on RAID1. This is because
    other levels requires synchronisation across all devices and so needed
    a different approach.
    Here is that approach.

    When a barrier arrives, we send a zero-length barrier to every active
    device. When that completes - and if the original request was not
    empty - we submit the barrier request itself (with the barrier flag
    cleared) and then submit a fresh load of zero length barriers.

    The barrier request itself is asynchronous, but any subsequent
    request will block until the barrier completes.

    The reason for clearing the barrier flag is that a barrier request is
    allowed to fail. If we pass a non-empty barrier through a striping
    raid level it is conceivable that part of it could succeed and part
    could fail. That would be way too hard to deal with.
    So if the first run of zero length barriers succeed, we assume all is
    sufficiently well that we send the request and ignore errors in the
    second run of barriers.

    RAID5 needs extra care as write requests may not have been submitted
    to the underlying devices yet. So we flush the stripe cache before
    proceeding with the barrier.

    Note that the second set of zero-length barriers are submitted
    immediately after the original request is submitted. Thus when
    a personality finds mddev->barrier to be set during make_request,
    it should not return from make_request until the corresponding
    per-device request(s) have been queued.

    That will be done in later patches.

    Signed-off-by: NeilBrown
    Reviewed-by: Andre Noll

    NeilBrown
     
  • A write intent bitmap can be removed from an array while the
    array is active.
    When this happens, all IO is suspended and flushed before the
    bitmap is removed.
    However it is possible that bitmap_daemon_work is still running to
    clear old bits from the bitmap. If it is, it can dereference the
    bitmap after it has been freed.

    So introduce a new mutex to protect bitmap_daemon_work and get it
    before destroying a bitmap.

    This is suitable for any current -stable kernel.

    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    NeilBrown
     

23 Sep, 2009

1 commit


21 Sep, 2009

1 commit


10 Aug, 2009

1 commit

  • A recent commit:
    commit 449aad3e25358812c43afc60918c5ad3819488e7

    introduced the possibility of an A-B/B-A deadlock between
    bd_mutex and reconfig_mutex.

    __blkdev_get holds bd_mutex while calling md_open which takes
    reconfig_mutex,
    do_md_run is always called with reconfig_mutex held, and it now
    takes bd_mutex in the call the revalidate_disk.

    This potential deadlock was not caught by lockdep due to the
    use of mutex_lock_interruptible_nexted which was introduced
    by
    commit d63a5a74dee87883fda6b7d170244acaac5b05e8
    do avoid a warning of an impossible deadlock.

    It is quite possible to split reconfig_mutex in to two locks.
    One protects the array data structures while it is being
    reconfigured, the other ensures that an array is never even partially
    open while it is being deactivated.
    In particular, the second lock prevents an open from completing
    between the time when do_md_stop checks if there are any active opens,
    and the time when the array is either set read-only, or when ->pers is
    set to NULL. So we can be certain that no IO is in flight as the
    array is being destroyed.

    So create a new lock, open_mutex, just to ensure exclusion between
    'open' and 'stop'.

    This avoids the deadlock and also avoids the lockdep warning mentioned
    in commit d63a5a74d

    Reported-by: "Mike Snitzer"
    Reported-by: "H. Peter Anvin"
    Signed-off-by: NeilBrown

    NeilBrown
     

03 Aug, 2009

1 commit

  • This patch replaces md_integrity_check() by two new public functions:
    md_integrity_register() and md_integrity_add_rdev() which are both
    personality-independent.

    md_integrity_register() is called from the ->run and ->hot_remove
    methods of all personalities that support data integrity. The
    function iterates over the component devices of the array and
    determines if all active devices are integrity capable and if their
    profiles match. If this is the case, the common profile is registered
    for the mddev via blk_integrity_register().

    The second new function, md_integrity_add_rdev() is called from the
    ->hot_add_disk methods, i.e. whenever a new device is being added
    to a raid array. If the new device does not support data integrity,
    or has a profile different from the one already registered, data
    integrity for the mddev is disabled.

    For raid0 and linear, only the call to md_integrity_register() from
    the ->run method is necessary.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

18 Jun, 2009

6 commits

  • If the superblock of a component device indicates the presence of a
    bitmap but the corresponding raid personality does not support bitmaps
    (raid0, linear, multipath, faulty), then something is seriously wrong
    and we'd better refuse to run such an array.

    Currently, this check is performed while the superblocks are examined,
    i.e. before entering personality code. Therefore the generic md layer
    must know which raid levels support bitmaps and which do not.

    This patch avoids this layer violation without adding identical code
    to various personalities. This is accomplished by introducing a new
    public function to md.c, md_check_no_bitmap(), which replaces the
    hard-coded checks in the superblock loading functions.

    A call to md_check_no_bitmap() is added to the ->run method of each
    personality which does not support bitmaps and assembly is aborted
    if at least one component device contains a bitmap.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • It is easiest to round sizes to multiples of chunk size in
    the personality code for those personalities which care.
    Those personalities now do the rounding, so we can
    remove that function from common code.

    Also remove the upper bound on the size of a chunk, and the lower
    bound on the size of a device (1 chunk), neither of which really buy
    us anything.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The difference between these two methods is artificial.
    Both check that a pending reshape is valid, and perform any
    aspect of it that can be done immediately.
    'reconfig' handles chunk size and layout.
    'check_reshape' handles raid_disks.

    So make them just one method.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Passing the new layout and chunksize as args is not necessary as
    the mddev has fields for new_check and new_layout.

    This is preparation for combining the check_reshape and reconfig
    methods

    Signed-off-by: NeilBrown

    NeilBrown
     
  • A straight-forward conversion which gets rid of some
    multiplications/divisions/shifts. The patch also introduces a couple
    of new ones, most of which are due to conf->chunk_size still being
    represented in bytes. This will be cleaned up in subsequent patches.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • This patch renames the chunk_size field to chunk_sectors with the
    implied change of semantics. Since

    is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
    = is_power_of_2(chunk_sectors)

    these bits don't need an adjustment for the shift.

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     

14 Apr, 2009

1 commit

  • - update inclusion guard and make sure it covers the whole file
    - remove superflous #ifdef CONFIG_BLOCK
    - make sure all required headers are included so that new users aren't
    required to include others before

    Signed-off-by: Christoph Hellwig
    Signed-off-by: NeilBrown

    Christoph Hellwig