23 Dec, 2011

4 commits

  • During recovery we want to write to the replacement but not
    the original. So we have two new flags
    - R5_NeedReplace if this stripe has a replacement that needs to
    be written at some stage
    - R5_WantReplace if NeedReplace, and the data is available, and
    a 'sync' has been requested on this stripe.

    We also distinguish between 'sync and replace' which need to read
    all other devices, and 'replace' which only needs to read the
    devices being replaced.

    Note that during resync we always write to any replacement device.
    It might not need to be written to, but as we don't read to compare,
    we have to write to be sure.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When writing, we need to submit two writes, one to the original, and
    one to the replacement - if there is a replacement.

    If the write to the replacement results in a write error, we just fail
    the device. We only try to record write errors to the original.

    When writing for recovery, we shouldn't write to the original. This
    will be addressed in a subsequent patch that generally addresses
    recovery.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Remove some #defines that are no longer used, and replace some
    others with an enum.
    And remove an unused field.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Just enhance data structures to record a second device per slot to be
    used as a 'replacement' device, replacing the original.
    We also have a second bio in each slot in each stripe_head. This will
    only be used when writing to the array - we need to write to both the
    original and the replacement at the same time, so will need two bios.

    For now, only try using the replacement drive for aligned-reads.
    In this case, we prefer the replacement if it has been recovered far
    enough, otherwise use the original.

    This includes a small enhancement. Previously we would only do
    aligned reads if the target device was fully recovered. Now we also
    do them if it has recovered far enough.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     

11 Oct, 2011

4 commits


28 Jul, 2011

3 commits


26 Jul, 2011

4 commits

  • Adding these three fields will allow more common code to be moved
    to handle_stripe()

    struct field rearrangement by Namhyung Kim.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • 'struct stripe_head_state' stores state about the 'current' stripe
    that is passed around while handling the stripe.
    For RAID6 there is an extension structure: r6_state, which is also
    passed around.
    There is no value in keeping these separate, so move the fields from
    the latter into the former.

    This means that all code now needs to treat s->failed_num as an small
    array, but this is a small cost.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • sh->lock is now mainly used to ensure that two threads aren't running
    in the locked part of handle_stripe[56] at the same time.

    That can more neatly be achieved with an 'active' flag which we set
    while running handle_stripe. If we find the flag is set, we simply
    requeue the stripe for later by setting STRIPE_HANDLE.

    For safety we take ->device_lock while examining the state of the
    stripe and creating a summary in 'stripe_head_state / r6_state'.
    This possibly isn't needed but as shared fields like ->toread,
    ->towrite are checked it is safer for now at least.

    We leave the label after the old 'unlock' called "unlock" because it
    will disappear in a few patches, so renaming seems pointless.

    This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
    later, but that is not a problem.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • This is the start of a series of patches to remove sh->lock.

    sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
    there is no race with testing it in handle_stripe[56].

    Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
    in handle_stripe[56] (after getting the same lock) and perform the
    same set/clear operations if it was set.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     

18 Apr, 2011

1 commit

  • md has some plugging infrastructure for RAID5 to use because the
    normal plugging infrastructure required a 'request_queue', and when
    called from dm, RAID5 doesn't have one of those available.

    This relied on the ->unplug_fn callback which doesn't exist any more.

    So remove all of that code, both in md and raid5. Subsequent patches
    with restore the plugging functionality.

    Signed-off-by: NeilBrown

    NeilBrown
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Sep, 2010

1 commit

  • This patch converts md to support REQ_FLUSH/FUA instead of now
    deprecated REQ_HARDBARRIER. In the core part (md.c), the following
    changes are notable.

    * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
    processing of other requests and thus there is no reason to mark the
    queue congested while FLUSH/FUA is in progress.

    * REQ_FLUSH/FUA failures are final and its users don't need retry
    logic. Retry logic is removed.

    * Preflush needs to be issued to all member devices but FUA writes can
    be handled the same way as other writes - their processing can be
    deferred to request_queue of member devices. md_barrier_request()
    is renamed to md_flush_request() and simplified accordingly.

    For linear, raid0 and multipath, the core changes are enough. raid1,
    5 and 10 need the following conversions.

    * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
    request_queues of member devices. Barrier related logic removed.

    * raid5: Queue draining logic dropped. FUA bit is propagated through
    biodrain and stripe resconstruction such that all the updated parts
    of the stripe are written out with FUA writes if any of the dirtying
    writes was FUA. preread_active_stripes handling in make_request()
    is updated as suggested by Neil Brown.

    * raid10: FUA bit needs to be propagated to write clones.

    linear, raid0, 1, 5 and 10 tested.

    Signed-off-by: Tejun Heo
    Reviewed-by: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Jul, 2010

4 commits


21 Jul, 2010

1 commit


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to places which didn't make it in one
    of the previous patches. All converions are trivial.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Acked-by: Borislav Petkov
    Cc: Dan Williams
    Cc: Huang Ying
    Cc: Len Brown
    Cc: Neil Brown

    Tejun Heo
     

16 Oct, 2009

2 commits

  • Signed-off-by: NeilBrown

    NeilBrown
     
  • The percpu conversion allowed a straightforward handoff of stripe
    processing to the async subsytem that initially showed some modest gains
    (+4%). However, this model is too simplistic and leads to stripes
    bouncing between raid5d and the async thread pool for every invocation
    of handle_stripe(). As reported by Holger this can fall into a
    pathological situation severely impacting throughput (6x performance
    loss).

    By downleveling the parallelism to raid_run_ops the pathological
    stripe_head bouncing is eliminated. This version still exhibits an
    average 11% throughput loss for:

    mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
    echo 1024 > /sys/block/md0/md/stripe_cache_size
    dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

    ...but the results are at least stable and can be used as a base for
    further multicore experimentation.

    Reported-by: Holger Kiehl
    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     

09 Sep, 2009

1 commit


30 Aug, 2009

4 commits

  • [ Based on an original patch by Yuri Tikhonov ]

    The raid_run_ops routine uses the asynchronous offload api and
    the stripe_operations member of a stripe_head to carry out xor+pq+copy
    operations asynchronously, outside the lock.

    The operations performed by RAID-6 are the same as in the RAID-5 case
    except for no support of STRIPE_OP_PREXOR operations. All the others
    are supported:
    STRIPE_OP_BIOFILL
    - copy data into request buffers to satisfy a read request
    STRIPE_OP_COMPUTE_BLK
    - generate missing blocks (1 or 2) in the cache from the other blocks
    STRIPE_OP_BIODRAIN
    - copy data out of request buffers to satisfy a write request
    STRIPE_OP_RECONSTRUCT
    - recalculate parity for new data that has entered the cache
    STRIPE_OP_CHECK
    - verify that the parity is correct

    The flow is the same as in the RAID-5 case, and reuses some routines, namely:
    1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
    2/ ops_complete_compute (updated to set up to 2 targets uptodate)
    3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

    [neilb@suse.de: fixes to get it to pass mdadm regression suite]
    Reviewed-by: Andre Noll
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Replace the flat zero_sum_result with a collection of flags to contain
    the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
    solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
    of DMA_ since these flags will be used on non-dma-zero-sum enabled
    platforms.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Use percpu memory rather than stack for storing the buffer lists used in
    parity calculations. Include space for dma address conversions and pass
    that to async_tx via the async_submit_ctl.scribble pointer.

    [ Impact: move memory pressure from stack to heap ]

    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for asynchronous handling of raid6 operations move the
    spare page to a percpu allocation to allow multiple simultaneous
    synchronous raid6 recovery operations.

    Make this allocation cpu hotplug aware to maximize allocation
    efficiency.

    Signed-off-by: Dan Williams

    Dan Williams
     

18 Jun, 2009

1 commit


16 Jun, 2009

1 commit

  • Having a macro just to cast a void* isn't really helpful.
    I would must rather see that we are simply de-referencing ->private,
    than have to know what the macro does.

    So open code the macro everywhere and remove the pointless cast.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2009

7 commits

  • We currently update the metadata :
    1/ every 3Megabytes
    2/ When the place we will write new-layout data to is recorded in
    the metadata as still containing old-layout data.

    Rule one exists to avoid having to re-do too much reshaping in the
    face of a crash/restart. So it should really be time based rather
    than size based. So change it to "every 10 seconds".

    Rule two turns out to be too harsh when restriping an array
    'in-place', as in that case the metadata much be updates for every
    stripe.
    For the in-place update, it can only possibly be safe from a crash if
    some user-space program data a backup of every e.g. few hundred
    stripes before allowing them to be reshaped. In that case, the
    constant metadata update is pointless.
    So only update the metadata if the new metadata will report that the
    end of the 'old-layout' data is beyond where we are currently
    writing 'new-layout' data.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Add prev_algo to raid5_conf_t along the same lines as prev_chunk
    and previous_raid_disks.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
    remember what the chunk size was before the reshape that is currently
    underway.

    This seems like duplication with "chunk_size" and "new_chunk" in
    mddev_t, and to some extent it is, but there are differences.
    The values in mddev_t are always defined and often the same.
    The prev* values are only defined if a reshape is underway.

    Also (and more significantly) the raid5_conf_t values will be changed
    at the same time (inside an appropriate lock) that the reshape is
    started by setting reshape_position. In contrast, the new_chunk value
    is set when the sysfs file is written which could be well before the
    reshape starts.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • During a raid5 reshape, we have some stripes in the cache that are
    'before' the reshape (and are still to be processed) and some that are
    'after'. They are currently differentiated by having different
    ->disks values as the only reshape current supported involves changing
    the number of disks.

    However we will soon support reshapes that do not change the number
    of disks (chunk parity or chunk size). So make the difference more
    explicit with a 'generation' number.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When reducing the number of devices in a raid4/5/6, the reshape
    process has to start at the end of the array and work down to the
    beginning. So we need to handle expand_progress and expand_lo
    differently.

    This patch renames "expand_progress" and "expand_lo" to avoid the
    implication that anything is getting bigger (expand->reshape) and
    every place they are used, we make sure that they are used the right
    way depending on whether delta_disks is positive or negative.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We now have this value in stripe_head so we don't need to duplicate
    it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Move the raid6 data processing routines into a standalone module
    (raid6_pq) to prepare them to be called from async_tx wrappers and other
    non-md drivers/modules. This precludes a circular dependency of raid456
    needing the async modules for data processing while those modules in
    turn depend on raid456 for the base level synchronous raid6 routines.

    To support this move:
    1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
    2/ The raid6_call, recovery calls, and table symbols are exported
    3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
    compile

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams