22 May, 2010

1 commit


18 May, 2010

10 commits


17 May, 2010

1 commit

  • Some levels expect the 'redundancy group' to be present,
    others don't.
    So when we change level of an array we might need to
    add or remove this group.

    This requires fixing up the current practice of overloading ->private
    to indicate (when ->pers == NULL) that something needs to be removed.
    So create a new ->to_remove to fill that role.

    When changing levels, we may need to add or remove attributes. When
    changing RAID5 -> RAID6, we both add and remove the same thing. It is
    important to catch this and optimise it out as the removal is delayed
    until a lock is released, so trying to add immediately would cause
    problems.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

07 May, 2010

1 commit


23 Apr, 2010

1 commit

  • Previous patch changes stripe and chunk_number to sector_t but
    mistakenly did not update all of the divisions to use sector_dev().

    This patch changes all the those divisions (actually the '%' operator)
    to sector_div.

    Signed-off-by: NeilBrown
    Cc: stable@kernel.org
    Tested-by: Stefan Lippers-Hollmann

    NeilBrown
     

20 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

03 Mar, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: add __percpu sparse annotations to what's left
    percpu: add __percpu sparse annotations to fs
    percpu: add __percpu sparse annotations to core kernel subsystems
    local_t: Remove leftover local.h
    this_cpu: Remove pageset_notifier
    this_cpu: Page allocator conversion
    percpu, x86: Generic inc / dec percpu instructions
    local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
    module: Use this_cpu_xx to dynamically allocate counters
    local_t: Remove cpu_local_xx macros
    percpu: refactor the code in pcpu_[de]populate_chunk()
    percpu: remove compile warnings caused by __verify_pcpu_ptr()
    percpu: make accessors check for percpu pointer in sparse
    percpu: add __percpu for sparse.
    percpu: make access macros universal
    percpu: remove per_cpu__ prefix.

    Linus Torvalds
     

26 Feb, 2010

1 commit


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to places which didn't make it in one
    of the previous patches. All converions are trivial.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Acked-by: Borislav Petkov
    Cc: Dan Williams
    Cc: Huang Ying
    Cc: Len Brown
    Cc: Neil Brown

    Tejun Heo
     

10 Feb, 2010

1 commit

  • ======
    This fix is related to
    http://bugzilla.kernel.org/show_bug.cgi?id=15142
    but does not address that exact issue.
    ======

    sysfs does like attributes being removed while they are being accessed
    (i.e. read or written) and waits for the access to complete.

    As accessing some md attributes takes the same lock that is held while
    removing those attributes a deadlock can occur.

    This patch addresses 3 issues in md that could lead to this deadlock.

    Two relate to calling flush_scheduled_work while the lock is held.
    This is probably a bad idea in general and as we use schedule_work to
    delete various sysfs objects it is particularly bad.

    In one case flush_scheduled_work is called from md_alloc (called by
    md_probe) called from do_md_run which holds the lock. This call is
    only present to ensure that ->gendisk is set. However we can be sure
    that gendisk is always set (though possibly we couldn't when that code
    was originally written. This is because do_md_run is called in three
    different contexts:
    1/ from md_ioctl. This requires that md_open has succeeded, and it
    fails if ->gendisk is not set.
    2/ from writing a sysfs attribute. This can only happen if the
    mddev has been registered in sysfs which happens in md_alloc
    after ->gendisk has been set.
    3/ from autorun_array which is only called by autorun_devices, which
    checks for ->gendisk to be set before calling autorun_array.
    So the call to md_probe in do_md_run can be removed, and the check on
    ->gendisk can also go.

    In the other case flush_scheduled_work is being called in do_md_stop,
    purportedly to wait for all md_delayed_delete calls (which delete the
    component rdevs) to complete. However there really isn't any need to
    wait for them - they have already been disconnected in all important
    ways.

    The third issue is that raid5->stop() removes some attribute names
    while the lock is held. There is already some infrastructure in place
    to delay attribute removal until after the lock is released (using
    schedule_work). So extend that infrastructure to remove the
    raid5_attrs_group.

    This does not address all lockdep issues related to the sysfs
    "s_active" lock. The rest can be address by splitting that lockdep
    context between symlinks and non-symlinks which hopefully will happen.

    Signed-off-by: NeilBrown

    NeilBrown
     

09 Feb, 2010

1 commit

  • This code was written long ago when it was not possible to
    reshape a degraded array. Now it is so the current level of
    degraded-ness needs to be taken in to account. Also newly addded
    devices should only reduce degradedness if they are deemed to be
    in-sync.

    In particular, if you convert a RAID5 to a RAID6, and increase the
    number of devices at the same time, then the 5->6 conversion will
    make the array degraded so the current code will produce a wrong
    value for 'degraded' - "-1" to be precise.

    If the reshape runs to completion end_reshape will calculate a correct
    new value for 'degraded', but if a device fails during the reshape an
    incorrect decision might be made based on the incorrect value of
    "degraded".

    This patch is suitable for 2.6.32-stable and if they are still open,
    2.6.31-stable and 2.6.30-stable as well.

    Cc: stable@kernel.org
    Reported-by: Michael Evans
    Signed-off-by: NeilBrown

    NeilBrown
     

14 Dec, 2009

4 commits

  • Suggested by Oren Held

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The post-barrier-flush is sent by md as soon as make_request on the
    barrier write completes. For raid5, the data might not be in the
    per-device queues yet. So for barrier requests, wait for any
    pre-reading to be done so that the request will be in the per-device
    queues.

    We use the 'preread_active' count to check that nothing is still in
    the preread phase, and delay the decrement of this count until after
    write requests have been submitted to the underlying devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Previously barriers were only supported on RAID1. This is because
    other levels requires synchronisation across all devices and so needed
    a different approach.
    Here is that approach.

    When a barrier arrives, we send a zero-length barrier to every active
    device. When that completes - and if the original request was not
    empty - we submit the barrier request itself (with the barrier flag
    cleared) and then submit a fresh load of zero length barriers.

    The barrier request itself is asynchronous, but any subsequent
    request will block until the barrier completes.

    The reason for clearing the barrier flag is that a barrier request is
    allowed to fail. If we pass a non-empty barrier through a striping
    raid level it is conceivable that part of it could succeed and part
    could fail. That would be way too hard to deal with.
    So if the first run of zero length barriers succeed, we assume all is
    sufficiently well that we send the request and ignore errors in the
    second run of barriers.

    RAID5 needs extra care as write requests may not have been submitted
    to the underlying devices yet. So we flush the stripe cache before
    proceeding with the barrier.

    Note that the second set of zero-length barriers are submitted
    immediately after the original request is submitted. Thus when
    a personality finds mddev->barrier to be set during make_request,
    it should not return from make_request until the corresponding
    per-device request(s) have been queued.

    That will be done in later patches.

    Signed-off-by: NeilBrown
    Reviewed-by: Andre Noll

    NeilBrown
     
  • qd_idx is previously declared and given exactly the same value!

    Signed-off-by: NeilBrown

    NeilBrown
     

13 Nov, 2009

2 commits

  • Normally is it not safe to allow a raid5 that is both dirty and
    degraded to be assembled without explicit request from that admin, as
    it can cause hidden data corruption.
    This is because 'dirty' means that the parity cannot be trusted, and
    'degraded' means that the parity needs to be used.

    However, if the device that is missing contains only parity, then
    there is no issue and assembly can continue.
    This particularly applies when a RAID5 is being converted to a RAID6
    and there is an unclean shutdown while the conversion is happening.

    So check for whether the degraded space only contains parity, and
    in that case, allow the assembly.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a reshape finds that it can add spare devices into the array,
    those devices might already be 'in_sync' if they are beyond the old
    size of the array, or they might not if they are within the array.

    The first case happens when we change an N-drive RAID5 to an
    N+1-drive RAID5.
    The second happens when we convert an N-drive RAID5 to an
    N+1-drive RAID6.

    So set the flag more carefully.
    Also, ->recovery_offset is only meaningful when the flag is clear,
    so only set it in that case.

    This change needs the preceding two to ensure that the non-in_sync
    device doesn't get evicted from the array when it is stopped, in the
    case where v0.90 metadata is used.

    Signed-off-by: NeilBrown

    NeilBrown
     

06 Nov, 2009

1 commit

  • This value is visible through sysfs and is used by mdadm
    when it manages a reshape (backing up data that is about to be
    rearranged). So it is important that it is always correct.
    Current it does not get updated properly when a reshape
    starts which can cause problems when assembling an array
    that is in the middle of being reshaped.

    This is suitable for 2.6.31.y stable kernels.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

20 Oct, 2009

1 commit


16 Oct, 2009

6 commits

  • md/raid6 passes a list of 'struct page *' to the async_tx routines,
    which then either DMA map them for offload, or take the page_address
    for CPU based calculations.

    For RAID6 we sometime leave 'blanks' in the list of pages.
    For CPU based calcs, we want to treat theses as a page of zeros.
    For offloaded calculations, we simply don't pass a page to the
    hardware.

    Currently the 'blanks' are encoded as a pointer to
    raid6_empty_zero_page. This is a 4096 byte memory region, not a
    'struct page'. This is mostly handled correctly but is rather ugly.

    So change the code to pass and expect a NULL pointer for the blanks.
    When taking page_address of a page, we need to check for a NULL and
    in that case use raid6_empty_zero_page.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a raid5 (or raid6) array is being reshaped to have fewer devices,
    conf->raid_disks is the latter and hence smaller number of devices.
    However sometimes we want to use a number which is the total number of
    currently required devices - the larger of the 'old' and 'new' sizes.
    Before we implemented reducing the number of devices, this was always
    'new' i.e. ->raid_disks.
    Now we need max(raid_disks, previous_raid_disks) in those places.

    This particularly affects assembling an array that was shutdown while
    in the middle of a reshape to fewer devices.

    md.c needs a similar fix when interpreting the md metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Signed-off-by: NeilBrown

    NeilBrown
     
  • The percpu conversion allowed a straightforward handoff of stripe
    processing to the async subsytem that initially showed some modest gains
    (+4%). However, this model is too simplistic and leads to stripes
    bouncing between raid5d and the async thread pool for every invocation
    of handle_stripe(). As reported by Holger this can fall into a
    pathological situation severely impacting throughput (6x performance
    loss).

    By downleveling the parallelism to raid_run_ops the pathological
    stripe_head bouncing is eliminated. This version still exhibits an
    average 11% throughput loss for:

    mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
    echo 1024 > /sys/block/md0/md/stripe_cache_size
    dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

    ...but the results are at least stable and can be used as a base for
    further multicore experimentation.

    Reported-by: Holger Kiehl
    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • Deallocating a raid5_conf_t structure requires taking 'device_lock'.
    Ensure it is initialized before it is used, i.e. initialize the lock
    before attempting any further initializations that might fail.

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • This reverts commit df10cfbc4d7ab93260d997df754219d390d62a9d.

    This patch was based on a misunderstanding and risks introducing a busy-wait loop.
    So revert it.

    Acked-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     

23 Sep, 2009

4 commits


17 Sep, 2009

1 commit

  • Neil says:
    "It is correct as it stands, but the fact that every branch in
    the 'if' part ends with a 'return' isn't immediately obvious,
    so it is clearer if we are explicit about the if / then / else
    structure."

    Signed-off-by: Dan Williams

    Dan Williams