02 Mar, 2007

1 commit

  • There are two errors that can lead to recovery problems with raid10
    when used in 'far' more (not the default).

    Due to a '>' instead of '>=' the wrong block is located which would result in
    garbage being written to some random location, quite possible outside the
    range of the device, causing the newly reconstructed device to fail.

    The device size calculation had some rounding errors (it didn't round when it
    should) and so recovery would go a few blocks too far which would again cause
    a write to a random block address and probably a device error.

    The code for working with device sizes was fairly confused and spread out, so
    this has been tided up a bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

12 Jan, 2007

1 commit

  • md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
    introducing additional latency.

    Fixing this in raid1 and raid10 seems to be straightforward enough.

    For our particular usage case in DRBD, passing this flag improved some
    initialization time from ~5 minutes to ~5 seconds.

    Acked-by: NeilBrown
    Signed-off-by: Lars Ellenberg
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lars Ellenberg
     

14 Dec, 2006

1 commit


29 Oct, 2006

1 commit


22 Oct, 2006

1 commit


03 Oct, 2006

5 commits

  • raid1, raid10 and multipath don't report their 'congested' status through
    bdi_*_congested, but should.

    This patch adds the appropriate functions which just check the 'congested'
    status of all active members (with appropriate locking).

    raid1 read_balance should be modified to prefer devices where
    bdi_read_congested returns false. Then we could use the '&' branch rather
    than the '|' branch. However that should would need some benchmarking first
    to make sure it is actually a good idea.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The error handling routines don't use proper locking, and so two concurrent
    errors could trigger a problem.

    So:
    - use test-and-set and test-and-clear to synchonise
    the In_sync bits with the ->degraded count
    - use the spinlock to protect updates to the
    degraded count (could use an atomic_t but that
    would be a bigger change in code, and isn't
    really justified)
    - remove un-necessary locking in raid5

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • It isn't needed as mddev->degraded contains equivalent info.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Instead of magic numbers (0,1,2,3) in sb_dirty, we have
    some flags instead:
    MD_CHANGE_DEVS
    Some device state has changed requiring superblock update
    on all devices.
    MD_CHANGE_CLEAN
    The array has transitions from 'clean' to 'dirty' or back,
    requiring a superblock update on active devices, but possibly
    not on spares
    MD_CHANGE_PENDING
    A superblock update is underway.

    We wait for an update to complete by waiting for all flags to be clear. A
    flag can be set at any time, even during an update, without risk that the
    change will be lost.

    Stop exporting md_update_sb - isn't needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid10d has toooo many nested block, so take the fix_read_error functionality
    out into a separate function.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

11 Jul, 2006

1 commit


27 Jun, 2006

4 commits

  • The size calculation made assumtion which the new offset mode didn't
    follow. This gets the size right in all cases.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The "industry standard" DDF format allows for a stripe/offset layout where
    data is duplicated on different stripes. e.g.

    A B C D
    D A B C
    E F G H
    H E F G

    (columns are drives, rows are stripes, LETTERS are chunks of data).

    This is similar to raid10's 'far' mode, but not quite the same. So enhance
    'far' mode with a 'far/offset' option which follows the layout of DDFs
    stripe/offset.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • For a while we have had checkpointing of resync. The version-1 superblock
    allows recovery to be checkpointed as well, and this patch implements that.

    Due to early carelessness we need to add a feature flag to signal that the
    recovery_offset field is in use, otherwise older kernels would assume that a
    partially recovered array is in fact fully recovered.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The largest chunk size the code can support without substantial surgery is
    2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day,
    the 'chunksize' should change to a sector-shift instead of a byte-count. Then
    no limit would be needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 May, 2006

2 commits


02 Apr, 2006

1 commit


04 Feb, 2006

1 commit

  • - version-1 superblock
    + The default_bitmap_offset is in sectors, not bytes.
    + the 'size' field in the superblock is in sectors, not KB
    - raid0_run should return a negative number on error, not '1'
    - raid10_read_balance should not return a valid 'disk' number if
    ->rdev turned out to be NULL
    - kmem_cache_destroy doesn't like being passed a NULL.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 Jan, 2006

1 commit


07 Jan, 2006

14 commits

  • Store this total in superblock (As appropriate), and make it available to
    userspace via sysfs.

    Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md sometimes call put_page on NULL pointers (treating it like kfree). This is
    not safe, so define and use a 'safe_put_page' which checks for NULL.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The code to overwrite/reread for addressing read errors in raid1/raid10
    currently assumes that the read will not alter the buffer which could be used
    to write to the next device. This is not a safe assumption to make.

    So we split the loops into a overwrite loop and a separate re-read loop, so
    that the writing is complete before reading is attempted.

    Cc: Paul Clements
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md supports multiple different RAID level, each being implemented by a
    'personality' (which is often in a separate module).

    These personalities have fairly artificial 'numbers'. The numbers
    are use to:
    1- provide an index into an array where the various personalities
    are recorded
    2- identify the module (via an alias) which implements are particular
    personality.

    Neither of these uses really justify the existence of personality numbers.
    The array can be replaced by a linked list which is searched (array lookup
    only happens very rarely). Module identification can be done using an alias
    based on level rather than 'personality' number.

    The current 'raid5' modules support two level (4 and 5) but only one
    personality. This slight awkwardness (which was handled in the mapping from
    level to personality) can be better handled by allowing raid5 to register 2
    personalities.

    With this change in place, the core md module does not need to have an
    exhaustive list of all possible personalities, so other personalities can be
    added independently.

    This patch also moves the check for chunksize being non-zero into the ->run
    routines for the personalities that need it, rather than having it in core-md.
    This has a side effect of allowing 'faulty' and 'linear' not to have a
    chunk-size set.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Replace multiple kmalloc/memset pairs with kzalloc calls.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Substitute:

    page_cache_get -> get_page
    page_cache_release -> put_page
    PAGE_CACHE_SHIFT -> PAGE_SHIFT
    PAGE_CACHE_SIZE -> PAGE_SIZE
    PAGE_CACHE_MASK -> PAGE_MASK
    __free_page -> put_page

    because we aren't using the page cache, we are just using pages.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Add in correct read-error handling for resync and read-only situations.

    When read-only, we don't over-write, so we need to mark the failed drive in
    the r10_bio so we don't re-try it. During resync, we always read all blocks,
    so if there is a read error, we simply over-write it with the good block that
    we found (assuming we found one).

    Note that the recovery case still isn't handled in an interesting way. There
    is nothing useful to do for the 2-copies case. If there are 3 or more copies,
    then we could try reading from one of the non-missing copies, but this is a
    bit complicated and very rarely would be used, so I'm leaving it for now.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Largely just a cross-port from raid1.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Also keep count on the number of errors found.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid10 needs to put up a barrier to new requests while it does resync or other
    background recovery. The code for this is currently open-coded, slighty
    obscure by its use of two waitqueues, and not documented.

    This patch gathers all the related code into 4 functions, and includes a
    comment which (hopefully) explains what is happening.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

29 Nov, 2005

1 commit

  • raid10 has two different layouts. One uses near-copies (so multiple
    copies of a block are at the same or similar offsets of different
    devices) and the other uses far-copies (so multiple copies of a block
    are stored a greatly different offsets on different devices). The point
    of far-copies is that it allows the first section (normally first half)
    to be layed out in normal raid0 style, and thus provide raid0 sequential
    read performance.

    Unfortunately, the read balancing in raid10 makes some poor decisions
    for far-copies arrays and you don't get the desired performance. So
    turn off that bad bit of read_balance for far-copies arrays.

    With this patch, read speed of an 'f2' array is comparable with a raid0
    with the same number of devices, though write speed is ofcourse still
    very slow.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Nov, 2005

2 commits


01 Nov, 2005

1 commit

  • Instead of having ->read_sectors and ->write_sectors, combine the two
    into ->sectors[2] and similar for the other fields. This saves a branch
    several places in the io path, since we don't have to care for what the
    actual io direction is. On my x86-64 box, that's 200 bytes less text in
    just the core (not counting the various drivers).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

10 Sep, 2005

1 commit