07 Jan, 2006

40 commits

  • Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This the role that a device has in an array can be viewed and set.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Move the checks - that dev size is never less than array size - into
    bind_rdev_to_array to make sure it always happens properly (there is one place
    where currently it doesn't).

    Also reject any superblock which claims an array size smaller than the device
    in question can hold.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • If array is active, try to reshape, else just set the value.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Store this total in superblock (As appropriate), and make it available to
    userspace via sysfs.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Allow it to be set to a particular version, or 'none'.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • ... only before array is started of course.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When we do a user-requested check/repair, we lose count of the outstanding
    requests...

    Also make sure that when anything is written to md/sync_action, the
    RECOVERY_NEEDED flag is set and the thread is woken up so any changes take
    effect.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When we update a page_cache page in the kernel, we need to flush_dache_page or
    userspace might not see the change.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Make the needlessly global function md_new_event() static.

    Signed-off-by: Adrian Bunk
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • .. because they aren't used outside md.c

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Commands written to sysfs files may, or my not, be \n terminated. We want to
    accept with case. For this we use cmd_match.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md sometimes call put_page on NULL pointers (treating it like kfree). This is
    not safe, so define and use a 'safe_put_page' which checks for NULL.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The kernel should not be imposing these policy limits: The time between
    bitmap updates should certainly be allowed to be more than 15 seconds, and
    if someone wants a bitmap chunk size in excess of 4MB, the kernel isn't the
    place to stop them.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The code to overwrite/reread for addressing read errors in raid1/raid10
    currently assumes that the read will not alter the buffer which could be used
    to write to the next device. This is not a safe assumption to make.

    So we split the loops into a overwrite loop and a separate re-read loop, so
    that the writing is complete before reading is attempted.

    Cc: Paul Clements
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md supports multiple different RAID level, each being implemented by a
    'personality' (which is often in a separate module).

    These personalities have fairly artificial 'numbers'. The numbers
    are use to:
    1- provide an index into an array where the various personalities
    are recorded
    2- identify the module (via an alias) which implements are particular
    personality.

    Neither of these uses really justify the existence of personality numbers.
    The array can be replaced by a linked list which is searched (array lookup
    only happens very rarely). Module identification can be done using an alias
    based on level rather than 'personality' number.

    The current 'raid5' modules support two level (4 and 5) but only one
    personality. This slight awkwardness (which was handled in the mapping from
    level to personality) can be better handled by allowing raid5 to register 2
    personalities.

    With this change in place, the core md module does not need to have an
    exhaustive list of all possible personalities, so other personalities can be
    added independently.

    This patch also moves the check for chunksize being non-zero into the ->run
    routines for the personalities that need it, rather than having it in core-md.
    This has a side effect of allowing 'faulty' and 'linear' not to have a
    chunk-size set.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • ...because that seems to be the preferred practice these days.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • - replace open-coded hash chain with hlist macros

    - Fix hash-table size at one page - it is already quite generous, so there
    will never be a need to use multiple pages, so no need for __get_free_pages

    No functional change.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Replace multiple kmalloc/memset pairs with kzalloc calls.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Substitute:

    page_cache_get -> get_page
    page_cache_release -> put_page
    PAGE_CACHE_SHIFT -> PAGE_SHIFT
    PAGE_CACHE_SIZE -> PAGE_SIZE
    PAGE_CACHE_MASK -> PAGE_MASK
    __free_page -> put_page

    because we aren't using the page cache, we are just using pages.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • With this patch it is possible to poll /proc/mdstat to detect arrays appearing
    or disappearing, to detect failures, recovery starting, recovery completing,
    and devices being added and removed.

    It is similar to the poll-ability of /proc/mounts, though different in that:

    We always report that the file is readable (because face it, it is, even if
    only for EOF).

    We report POLLPRI when there is a change so that select() can detect
    it as an exceptional event. Not only are these exceptional events, but
    that is the mechanism that the current 'mdadm' uses to watch for events
    (It also polls after a timeout).
    (We also report POLLERR like /proc/mounts).

    Finally, we only reset the per-file event counter when the start of the file
    is read, rather than when poll() returns an event. This is more robust as it
    means that an fd will continue to report activity to poll/select until the
    program clearly responds to that activity.

    md_new_event takes an 'mddev' which isn't currently used, but it will be soon.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Add in correct read-error handling for resync and read-only situations.

    When read-only, we don't over-write, so we need to mark the failed drive in
    the r10_bio so we don't re-try it. During resync, we always read all blocks,
    so if there is a read error, we simply over-write it with the good block that
    we found (assuming we found one).

    Note that the recovery case still isn't handled in an interesting way. There
    is nothing useful to do for the 2-copies case. If there are 3 or more copies,
    then we could try reading from one of the non-missing copies, but this is a
    bit complicated and very rarely would be used, so I'm leaving it for now.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Largely just a cross-port from raid1.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We are inadvertently setting the R1BIO_Uptodate bit on read errors when we
    decide not to try correcting (because there are no other working devices).
    This means that the read error is reported to the client as success.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Where performing a user-requested 'check' or 'repair', we read all readable
    devices, and compare the contents. We only write to blocks which had read
    errors, or blocks with content that differs from the first good device found.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Also keep count on the number of errors found.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • There is this "FIXME" comment with a typo in it!! that been annoying me for
    days, so I just had to remove it.

    conf->disks[i].rdev should only be accessed if
    - we know we hold a reference or
    - the mddev->reconfig_sem is down or
    - we have a rcu_readlock

    handle_stripe was referencing rdev in three places without any of these. For
    the first two, get an rcu_readlock. For the last, the same access
    (md_sync_acct call) is made a little later after the rdev has been claimed
    under and rcu_readlock, if R5_Syncio is set. So just use that access...
    However R5_Syncio isn't really needed as the 'syncing' variable contains the
    same information. So use that instead.

    Issues, comment, and fix are identical in raid5 and raid6.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Handling of read errors during resync is separate from handling of read errors
    during normal IO in raid1. A previous patch added support for read errors
    during normal IO. This one adds support for read errors during resync or
    recovery.

    The key differences are that we don't need to freeze the array, because the
    normal handling of resync means that this part of the array will be idle
    except for resync, and the read/overwrite/re-read is needed in a separate
    piece of code.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We are dereferencing ->rdev without an rcu lock!

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • On a read-error we suspend the array, then synchronously read the block from
    other arrays until we find one where we can read it. Then we try writing the
    good data back everywhere and make sure it works. If any write or subsequent
    read fails, only then do we fail the device out of the array.

    To be able to suspend the array, we need to also keep track of how many
    requests are queued for handling by raid1d.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This is a simple port of match functionality across from raid5. If we get a
    read error, we don't kick the drive straight away, but try to over-write with
    good data first.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid6 currently does not check the P/Q syndromes when doing a resync, it just
    calculates the correct value and writes it. Doing the check can reduce writes
    (often to 0) for a resync, and it is needed to properly implement the

    echo check > sync_action

    operation.

    This patch implements the appropriate checks and tidies up some related code.

    It also allows raid6 user-requested resync to bypass the intent bitmap.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This is important because bitmap_create uses
    mddev->resync_max_sectors
    and that doesn't have a valid value until after the array
    has been initialised (with pers->run()).
    [It doesn't make a difference for current personalities that
    support bitmaps, but will make a difference for raid10]

    This has the added advantage of meaning with can move the thread->timeout
    manipulation inside the bitmap.c code instead of sprinkling identical code
    throughout all personalities.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown