18 Aug, 2010

2 commits

  • md_check_recovery expects ->spare_active to return 'true' if any
    spares were activated, but none of them do, so the consequent change
    in 'degraded' is not notified through sysfs.

    So count the number of spares activated, subtract it from 'degraded'
    just once, and return it.

    Reported-by: Adrian Drzewiecki
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When RAID1 is done syncing disks, it'll update the state
    of synced rdevs to In_sync. But it neglected to notify
    sysfs that the attribute changed. So any programs that
    are waiting for an rdev's state to change will not be
    woken.

    (raid5/raid10 added by neilb)

    Signed-off-by: Adrian Drzewiecki
    Signed-off-by: NeilBrown

    Adrian Drzewiecki
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://neil.brown.name/md: (24 commits)
    md: clean up do_md_stop
    md: fix another deadlock with removing sysfs attributes.
    md: move revalidate_disk() back outside open_mutex
    md/raid10: fix deadlock with unaligned read during resync
    md/bitmap: separate out loading a bitmap from initialising the structures.
    md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
    md/bitmap: optimise scanning of empty bitmaps.
    md/bitmap: clean up plugging calls.
    md/bitmap: reduce dependence on sysfs.
    md/bitmap: white space clean up and similar.
    md/raid5: export raid5 unplugging interface.
    md/plug: optionally use plugger to unplug an array during resync/recovery.
    md/raid5: add simple plugging infrastructure.
    md/raid5: export is_congested test
    raid5: Don't set read-ahead when there is no queue
    md: add support for raising dm events.
    md: export various start/stop interfaces
    md: split out md_rdev_init
    md: be more careful setting MD_CHANGE_CLEAN
    md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
    ...

    Linus Torvalds
     

08 Aug, 2010

1 commit

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Jul, 2010

6 commits


21 Jul, 2010

2 commits


24 Jun, 2010

6 commits

  • There are few situations where it would make any sense to add a spare
    when reducing the number of devices in an array, but it is
    conceivable: A 6 drive RAID6 with two missing devices could be
    reshaped to a 5 drive RAID6, and a spare could become available
    just in time for the reshape, but not early enough to have been
    recovered first. 'freezing' recovery can make this easy to
    do without any races.

    However doing such a thing is a bad idea. md will not record the
    partially-recovered state of the 'spare' and when the reshape
    finished it will think that the spare is still spare.
    Easiest way to avoid this confusion is to simply disallow it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As the comment says, the tail of this loop only applies to devices
    that are not fully in sync, so if In_sync was set, we should avoid
    the rest of the loop.

    This bug will hardly ever cause an actual problem. The worst it
    can do is allow an array to be assembled that is dirty and degraded,
    which is not generally a good idea (without warning the sysadmin
    first).

    This will only happen if the array is RAID4 or a RAID5/6 in an
    intermediate state during a reshape and so has one drive that is
    all 'parity' - no data - while some other device has failed.

    This is certainly possible, but not at all common.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • During a recovery of reshape the early part of some devices might be
    in-sync while the later parts are not.
    We we know we are looking at an early part it is good to treat that
    part as in-sync for stripe calculations.

    This is particularly important for a reshape which suffers device
    failure. Treating the data as in-sync can mean the difference between
    data-safety and data-loss.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When we are reshaping an array, the device failure combinations
    that cause us to decide that the array as failed are more subtle.

    In particular, any 'spare' will be fully in-sync in the section
    of the array that has already been reshaped, thus failures that
    affect only that section are less critical.

    So encode this subtlety in a new function and call it as appropriate.

    The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6
    conversion where the last two devices failed.
    This resulted in:

    good good good good incomplete good good failed failed

    while converting a 5-drive RAID6 to 8 drive RAID5
    The incomplete device causes the whole array to look bad,
    bad as it was actually good for the section that had been
    converted to 8-drives, all the data was actually safe.

    Reported-by: Terry Morris
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When an array is reshaped to have fewer devices, the reshape proceeds
    from the end of the devices to the beginning.

    If a device happens to be non-In_sync (which is possible but rare)
    we would normally update the ->recovery_offset as the reshape
    progresses. However that would be wrong as the recover_offset records
    that the early part of the device is in_sync, while in fact it would
    only be the later part that is in_sync, and in any case the offset
    number would be measured from the wrong end of the device.

    Relatedly, if after a reshape a spare is discovered to not be
    recoverred all the way to the end, not allow spare_active
    to incorporate it in the array.

    This becomes relevant in the following sample scenario:

    A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
    operation.
    The RAID5->RAID6 conversion will cause a 5 drive to be included as a
    spare, then the 5drive -> 6drive reshape will effectively rebuild that
    spare as it progresses. The 6th drive is treated as in_sync the whole
    time as there is never any case that we might consider reading from
    it, but must not because there is no valid data.

    If we interrupt this reshape part-way through and reverse it to return
    to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
    the recovery_offset - as that would be wrong - and we don't want to
    include that spare as active in the 5-drive RAID6 when the reversed
    reshape completed and it will be mostly out-of-sync still.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The entries in the stripe_cache maintained by raid5 are enlarged
    when we increased the number of devices in the array, but not
    shrunk when we reduce the number of devices.
    So if entries are added after reducing the number of devices, we
    much ensure to initialise the whole entry, not just the part that
    is currently relevant. Otherwise if we enlarge the array again,
    we will reference uninitialised values.

    As grow_buffers/shrink_buffer now want to use a count that is stored
    explicity in the raid_conf, they should get it from there rather than
    being passed it as a parameter.

    Signed-off-by: NeilBrown

    NeilBrown
     

28 May, 2010

1 commit


22 May, 2010

1 commit


18 May, 2010

10 commits


17 May, 2010

1 commit

  • Some levels expect the 'redundancy group' to be present,
    others don't.
    So when we change level of an array we might need to
    add or remove this group.

    This requires fixing up the current practice of overloading ->private
    to indicate (when ->pers == NULL) that something needs to be removed.
    So create a new ->to_remove to fill that role.

    When changing levels, we may need to add or remove attributes. When
    changing RAID5 -> RAID6, we both add and remove the same thing. It is
    important to catch this and optimise it out as the removal is delayed
    until a lock is released, so trying to add immediately would cause
    problems.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

07 May, 2010

1 commit


23 Apr, 2010

1 commit

  • Previous patch changes stripe and chunk_number to sector_t but
    mistakenly did not update all of the divisions to use sector_dev().

    This patch changes all the those divisions (actually the '%' operator)
    to sector_div.

    Signed-off-by: NeilBrown
    Cc: stable@kernel.org
    Tested-by: Stefan Lippers-Hollmann

    NeilBrown
     

20 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

03 Mar, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: add __percpu sparse annotations to what's left
    percpu: add __percpu sparse annotations to fs
    percpu: add __percpu sparse annotations to core kernel subsystems
    local_t: Remove leftover local.h
    this_cpu: Remove pageset_notifier
    this_cpu: Page allocator conversion
    percpu, x86: Generic inc / dec percpu instructions
    local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
    module: Use this_cpu_xx to dynamically allocate counters
    local_t: Remove cpu_local_xx macros
    percpu: refactor the code in pcpu_[de]populate_chunk()
    percpu: remove compile warnings caused by __verify_pcpu_ptr()
    percpu: make accessors check for percpu pointer in sparse
    percpu: add __percpu for sparse.
    percpu: make access macros universal
    percpu: remove per_cpu__ prefix.

    Linus Torvalds
     

26 Feb, 2010

1 commit


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to places which didn't make it in one
    of the previous patches. All converions are trivial.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Acked-by: Borislav Petkov
    Cc: Dan Williams
    Cc: Huang Ying
    Cc: Len Brown
    Cc: Neil Brown

    Tejun Heo
     

10 Feb, 2010

1 commit

  • ======
    This fix is related to
    http://bugzilla.kernel.org/show_bug.cgi?id=15142
    but does not address that exact issue.
    ======

    sysfs does like attributes being removed while they are being accessed
    (i.e. read or written) and waits for the access to complete.

    As accessing some md attributes takes the same lock that is held while
    removing those attributes a deadlock can occur.

    This patch addresses 3 issues in md that could lead to this deadlock.

    Two relate to calling flush_scheduled_work while the lock is held.
    This is probably a bad idea in general and as we use schedule_work to
    delete various sysfs objects it is particularly bad.

    In one case flush_scheduled_work is called from md_alloc (called by
    md_probe) called from do_md_run which holds the lock. This call is
    only present to ensure that ->gendisk is set. However we can be sure
    that gendisk is always set (though possibly we couldn't when that code
    was originally written. This is because do_md_run is called in three
    different contexts:
    1/ from md_ioctl. This requires that md_open has succeeded, and it
    fails if ->gendisk is not set.
    2/ from writing a sysfs attribute. This can only happen if the
    mddev has been registered in sysfs which happens in md_alloc
    after ->gendisk has been set.
    3/ from autorun_array which is only called by autorun_devices, which
    checks for ->gendisk to be set before calling autorun_array.
    So the call to md_probe in do_md_run can be removed, and the check on
    ->gendisk can also go.

    In the other case flush_scheduled_work is being called in do_md_stop,
    purportedly to wait for all md_delayed_delete calls (which delete the
    component rdevs) to complete. However there really isn't any need to
    wait for them - they have already been disconnected in all important
    ways.

    The third issue is that raid5->stop() removes some attribute names
    while the lock is held. There is already some infrastructure in place
    to delay attribute removal until after the lock is released (using
    schedule_work). So extend that infrastructure to remove the
    raid5_attrs_group.

    This does not address all lockdep issues related to the sysfs
    "s_active" lock. The rest can be address by splitting that lockdep
    context between symlinks and non-symlinks which hopefully will happen.

    Signed-off-by: NeilBrown

    NeilBrown
     

09 Feb, 2010

1 commit

  • This code was written long ago when it was not possible to
    reshape a degraded array. Now it is so the current level of
    degraded-ness needs to be taken in to account. Also newly addded
    devices should only reduce degradedness if they are deemed to be
    in-sync.

    In particular, if you convert a RAID5 to a RAID6, and increase the
    number of devices at the same time, then the 5->6 conversion will
    make the array degraded so the current code will produce a wrong
    value for 'degraded' - "-1" to be precise.

    If the reshape runs to completion end_reshape will calculate a correct
    new value for 'degraded', but if a device fails during the reshape an
    incorrect decision might be made based on the incorrect value of
    "degraded".

    This patch is suitable for 2.6.32-stable and if they are still open,
    2.6.31-stable and 2.6.30-stable as well.

    Cc: stable@kernel.org
    Reported-by: Michael Evans
    Signed-off-by: NeilBrown

    NeilBrown