24 Feb, 2011

1 commit

  • Revert
    b821eaa572fd737faaf6928ba046e571526c36c6
    and
    f3b99be19ded511a1bf05a148276239d9f13eefa

    When I wrote the first of these I had a wrong idea about the
    lifetime of 'struct block_device'. It can disappear at any time that
    the block device is not open if it falls out of the inode cache.

    So relying on the 'size' recorded with it to detect when the
    device size has changed and so we need to revalidate, is wrong.

    Rather, we really do need the 'changed' attribute stored directly in
    the mddev and set/tested as appropriate.

    Without this patch, a sequence of:
    mknod / open / close / unlink

    (which can cause a block_device to be created and then destroyed)
    will result in a rescan of the partition table and consequence removal
    and addition of partitions.
    Several of these in a row can get udev racing to create and unlink and
    other code can get confused.

    With the patch, the rescan is only performed when needed and so there
    are no races.

    This is suitable for any stable kernel from 2.6.35.

    Reported-by: "Wojcik, Krzysztof"
    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    NeilBrown
     

21 Feb, 2011

1 commit

  • blk_throtl_exit assumes that ->queue_lock still exists,
    so make sure that it does.
    To do this, we stop redirecting ->queue_lock to conf->device_lock
    and leave it pointing where it is initialised - __queue_lock.

    As the blk_plug functions check the ->queue_lock is held, we now
    take that spin_lock explicitly around the plug functions. We don't
    need the locking, just the warning removal.

    This is needed for any kernel with the blk_throtl code, which is
    which is 2.6.37 and later.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

16 Feb, 2011

2 commits

  • 'mdp' devices are md devices with preallocated device numbers
    for partitions. As such it is possible to mknod and open a partition
    before opening the whole device.

    this causes md_probe() to be called with a device number of a
    partition, which in-turn calls mddev_find with such a number.

    However mddev_find expects the number of a 'whole device' and
    does the wrong thing with partition numbers.

    So add code to mddev_find to remove the 'partition' part of
    a device number and just work with the 'whole device'.

    This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652

    Reported-by: hkmaly@bigfoot.com
    Signed-off-by: NeilBrown
    Cc:

    NeilBrown
     
  • If the desired size of an array is set (via sysfs) before the array is
    active (which is the normal sequence), we currrently call set_capacity
    immediately.
    This means that a subsequent 'open' (as can be caused by some
    udev-triggers program) will notice the new size and try to probe for
    partitions. However as the array isn't quite ready yet the read will
    fail. Then when the array is read, as the size doesn't change again
    we don't try to re-probe.

    So when setting array size via sysfs, only call set_capacity if the
    array is already active.

    Signed-off-by: NeilBrown

    NeilBrown
     

14 Feb, 2011

1 commit

  • Takeover raid1->raid0 not succeded. Kernel message is shown:
    "md/raid0:md126: too few disks (1 of 2) - aborting!"

    Problem was that we weren't updating ->raid_disks for that
    takeover, unlike all the others.

    Signed-off-by: Krzysztof Wojcik
    Signed-off-by: NeilBrown

    Krzysztof Wojcik
     

08 Feb, 2011

2 commits

  • Following symptoms were observed:
    1. After raid0->raid10 takeover operation we have array with 2
    missing disks.
    When we add disk for rebuild, recovery process starts as expected
    but it does not finish- it stops at about 90%, md126_resync process
    hangs in "D" state.
    2. Similar behavior is when we have mounted raid0 array and we
    execute takeover to raid10. After this when we try to unmount array-
    it causes process umount hangs in "D"

    In scenarios above processes hang at the same function- wait_barrier
    in raid10.c.
    Process waits in macro "wait_event_lock_irq" until the
    "!conf->barrier" condition will be true.
    In scenarios above it never happens.

    Reason was that at the end of level_store, after calling pers->run,
    we call mddev_resume. This calls pers->quiesce(mddev, 0) with
    RAID10, that calls lower_barrier.
    However raise_barrier hadn't been called on that 'conf' yet,
    so conf->barrier becomes negative, which is bad.

    This patch introduces setting conf->barrier=1 after takeover
    operation. It prevents to become barrier negative after call
    lower_barrier().

    Signed-off-by: Krzysztof Wojcik
    Signed-off-by: NeilBrown

    Krzysztof Wojcik
     
  • md_make_request was calling bio_sectors() for part_stat_add
    after it was calling the make_request function. This is
    bad because the make_request function can free the bio and
    because the bi_size field can change around.

    The fix here was suggested by Jens Axboe. It saves the
    sector count before the make_request call. I hit this
    with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
    his pretty fusionio card.

    Cc:
    Signed-off-by: Chris Mason
    Signed-off-by: NeilBrown

    Chris Mason
     

02 Feb, 2011

1 commit


31 Jan, 2011

8 commits

  • There is no need to set this to zero at this point. It will be
    set to zero by remove_and_add_spares or at the start of
    md_do_sync at the latest.
    And setting it to zero before MD_RECOVERY_RUNNING is cleared can
    make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
    just as resync is finishing.

    So simply remove this setting to zero.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • remove_and_add_spares is called in two places where the needs really
    are very different.
    remove_and_add_spares should not be called on an array which is about
    to be reshaped as some extra devices might have been manually added
    and that would remove them. However if the array is 'read-auto',
    that will currently happen, which is bad.

    So in the 'ro != 0' case don't call remove_and_add_spares but simply
    remove the failed devices as the comment suggests is needed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch introduces raid 1 to raid0 takeover operation
    in kernel space.

    Signed-off-by: Krzysztof Wojcik
    Signed-off-by: Neil Brown

    Krzysztof Wojcik
     
  • This flag is not needed and is used badly.

    Devices that are included in a native-metadata array are reserved
    exclusively for that array - and currently have AllReserved set.
    They all are bd_claimed for the rdev and so cannot be shared.

    Devices that are included in external-metadata arrays can be shared
    among multiple arrays - providing there is no overlap.
    These are bd_claimed for md in general - not for a particular rdev.

    When changing the amount of a device that is used in an array we need
    to check for overlap. This currently includes a check on AllReserved
    So even without overlap, sharing with an AllReserved device is not
    allowed.
    However the bd_claim usage already precludes sharing with these
    devices, so the test on AllReserved is not needed. And in fact it is
    wrong.

    As this is the only use of AllReserved, simply remove all usage and
    definition of AllReserved.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As spares can be added manually before a reshape starts, we need to
    find them all to mark some of them as in_sync.

    Previously we would abort looking for spares when we found an
    unallocated spare what could not be added to the array (implying there
    was no room for new spares). However already-added spares could be
    later in the list, so we need to keep searching.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As spares can be added to the array before the reshape is started,
    we need to find and count them when checking there are enough.
    The array could have been degraded, so we need to check all devices,
    no just those out side of the range of devices in the array before
    the reshape.

    So instead of checking the index, check the In_sync flag as that
    reliably tells if the device is a spare or this purpose.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • There are two consecutive 'if' statements.

    if (mddev->delta_disks >= 0)
    ....
    if (mddev->delta_disks > 0)

    The code in the second is equally valid if delta_disks == 0, and these
    two statements are the only place that 'added_devices' is used.

    So make them a single if statement, make added_devices a local
    variable, and re-indent it all.

    No functional change.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we try to update_raid_disks and it fails, we should put
    'delta_disks' back to zero. This is important because some code,
    such as slot_store, assumes that delta_disks has been validated.

    Signed-off-by: NeilBrown

    NeilBrown
     

15 Jan, 2011

1 commit

  • Commit e09b457b (block: simplify holder symlink handling) incorrectly
    assumed that there is only one link at maximum. dm may use multiple
    links and expects block layer to track reference count for each link,
    which is different from and unrelated to the exclusive device holder
    identified by @holder when the device is opened.

    Remove the single holder assumption and automatic removal of the link
    and revive the per-link reference count tracking. The code
    essentially behaves the same as before commit e09b457b sans the
    unnecessary kobject reference count dancing.

    While at it, note that this facility should not be used by anyone else
    than the current ones. Sysfs symlinks shouldn't be abused like this
    and the whole thing doesn't belong in the block layer at all.

    Signed-off-by: Tejun Heo
    Reported-by: Milan Broz
    Cc: Jun'ichi Nomura
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Jan, 2011

23 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
    dm: raid456 basic support
    dm: per target unplug callback support
    dm: introduce target callbacks and congestion callback
    dm mpath: delay activate_path retry on SCSI_DH_RETRY
    dm: remove superfluous irq disablement in dm_request_fn
    dm log: use PTR_ERR value instead of ENOMEM
    dm snapshot: avoid storing private suspended state
    dm snapshot: persistent make metadata_wq multithreaded
    dm: use non reentrant workqueues if equivalent
    dm: convert workqueues to alloc_ordered
    dm stripe: switch from local workqueue to system_wq
    dm: dont use flush_scheduled_work
    dm snapshot: remove unused dm_snapshot queued_bios_work
    dm ioctl: suppress needless warning messages
    dm crypt: add loop aes iv generator
    dm crypt: add multi key capability
    dm crypt: add post iv call to iv generator
    dm crypt: use io thread for reads only if mempool exhausted
    dm crypt: scale to multiple cpus
    dm crypt: simplify compatible table output
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix removal of extra drives when converting RAID6 to RAID5
    md: range check slot number when manually adding a spare.
    md/raid5: handle manually-added spares in start_reshape.
    md: fix sync_completed reporting for very large drives (>2TB)
    md: allow suspend_lo and suspend_hi to decrease as well as increase.
    md: Don't let implementation detail of curr_resync leak out through sysfs.
    md: separate meta and data devs
    md-new-param-to_sync_page_io
    md-new-param-to-calc_dev_sboffset
    md: Be more careful about clearing flags bit in ->recovery
    md: md_stop_writes requires mddev_lock.
    md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
    md: Ensure no IO request to get md device before it is properly initialised.
    md: Fix single printks with multiple KERN_s
    md: fix regression resulting in delays in clearing bits in a bitmap
    md: fix regression with re-adding devices to arrays with no metadata

    Linus Torvalds
     
  • When a RAID6 is converted to a RAID5, the extra drive should
    be discarded. However it isn't due to a typo in a comparison.

    This bug was introduced in commit e93f68a1fc6 in 2.6.35-rc4
    and is suitable for any -stable since than.

    As the extra drive is not removed, the 'degraded' counter is wrong and
    so the RAID5 will not respond correctly to a subsequent failure.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When adding a spare to an active array, we should check the slot
    number, but allow it to be larger than raid_disks if a reshape
    is being prepared.

    Apply the same test when adding a device to an
    array-under-construction. It already had most of the test in place,
    but not quite all.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It is possible to manually add spares to specific slots before
    starting a reshape.
    raid5_start_reshape should recognised this possibility and include
    it in the accounting.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The values exported in the sync_completed file are unsigned long, which
    overflows with very large drives, resulting in wrong values reported.

    Since sync_completed uses sectors as unit, we'll start getting wrong
    values with components larger than 2TB.

    This patch simply replaces the use of unsigned long by unsigned long long.

    Signed-off-by: Rémi Rérolle
    Signed-off-by: NeilBrown

    Rémi Rérolle
     
  • The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
    to which read/writes are suspended so that the under lying data can be
    manipulated without user-space noticing.
    Currently the window they describe can only move forwards along the
    device. However this is an unnecessary restriction which will cause
    problems with planned developments.
    So relax this restriction and allow these endpoints to move
    arbitrarily.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • mddev->curr_resync has artificial values of '1' and '2' which are used
    by the code which ensures only one resync is happening at a time on
    any given device.

    These values are internal and should never be exposed to user-space
    (except when translated appropriately as in the 'pending' status in
    /proc/mdstat).

    Unfortunately they are as ->curr_resync is assigned to
    ->curr_resync_completed and that value is directly visible through
    sysfs.

    So change the assignments to ->curr_resync_completed to get the same
    valued from elsewhere in a form that doesn't have the magic '1' or '2'
    values.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Allow the metadata to be on a separate device from the
    data.

    This doesn't mean the data and metadata will by on separate
    physical devices - it simply gives device-mapper and userspace
    tools more flexibility.

    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Add new parameter to 'sync_page_io'.

    The new parameter allows us to distinguish between metadata and data
    operations. This becomes important later when we add the ability to
    use separate devices for data and metadata.

    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • When we allow for separate devices for data and metadata
    in a later patch, we will need to be able to calculate
    the superblock offset based on more than the bdev.

    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • Setting ->recovery to 0 is generally not a good idea as it could clear
    bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN
    should only be cleared on explicit request from user-space.

    So when we need to clear things, just clear the bits that need
    clearing.

    As there are a few different places which reap a resync process - and
    some do an incomplte job - factor out the code for doing the from
    md_check_recovery and call that function instead of open coding part
    of it.

    Signed-off-by: NeilBrown
    Reported-by: Jonathan Brassow

    NeilBrown
     
  • As md_stop_writes manipulates the sync_thread and calls md_update_sb,
    it need to be called with mddev_lock held.

    In all internal cases it is, but the symbol is exported for dm-raid to
    call and in that case the lock won't be help.
    Do make an exported version which takes the lock, and an internal
    version which does not.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • With the module parameter 'start_dirty_degraded' set,
    raid5_spare_active() previously called sysfs_notify_dirent() with a NULL
    argument (rdev->sysfs_state) when a rebuild finished.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer

    Jonathan Brassow
     
  • When an md device is in the process of coming on line it is possible
    for an IO request (typically a partition table probe) to get through
    before the array is fully initialised, which can cause unexpected
    behaviour (e.g. a crash).

    So explicitly record when the array is ready for IO and don't allow IO
    through until then.

    There is no possibility for a similar problem when the array is going
    off-line as there must only be one 'open' at that time, and it is busy
    off-lining the array and so cannot send IO requests. So no memory
    barrier is needed in md_stop()

    This has been a bug since commit 409c57f3801 in 2.6.30 which
    introduced md_make_request. Before then, each personality would
    register its own make_request_fn when it was ready.
    This is suitable for any stable kernel from 2.6.30.y onwards.

    Cc:
    Signed-off-by: NeilBrown
    Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"

    NeilBrown
     
  • Noticed-by: Russell King
    Signed-off-by: Joe Perches
    Signed-off-by: NeilBrown

    Joe Perches
     
  • commit 589a594be1fb (2.6.37-rc4) fixed a problem were md_thread would
    sometimes call the ->run function at a bad time.

    If an error is detected during array start up after the md_thread has
    been started, the md_thread is killed. This resulted in the ->run
    function being called once. However the array may not be in a state
    that it is safe to call ->run.

    However the fix imposed meant that ->run was not called on a timeout.
    This means that when an array goes idle, bitmap bits do not get
    cleared promptly. While the array is busy the bits will still be
    cleared when appropriate so this is not very serious. There is no
    risk to data.

    Change the test so that we only avoid calling ->run when the thread
    is being stopped. This more explicitly addresses the problem situation.

    This is suitable for 2.6.37-stable and any -stable kernel to which
    589a594be1fb was applied.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch is the skeleton for the DM target that will be
    the bridge from DM to MD (initially RAID456 and later RAID1). It
    provides a way to use device-mapper interfaces to the MD RAID456
    drivers.

    As with all device-mapper targets, the nominal public interfaces are the
    constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
    and STATUSTYPE_TABLE). The CTR table looks like the following:

    1: raid \
    2: \
    3: ..

    Line 1 contains the standard first three arguments to any device-mapper
    target - the start, length, and target type fields. The target type in
    this case is "raid".

    Line 2 contains the arguments that define the particular raid
    type/personality/level, the required arguments for that raid type, and
    any optional arguments. Possible raid types include: raid4, raid5_la,
    raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
    planned for the future.) The list of required and optional parameters
    is the same for all the current raid types. The required parameters are
    positional, while the optional parameters are given as key/value pairs.
    The possible parameters are as follows:
    Chunk size in sectors.
    [[no]sync] Force/Prevent RAID initialization
    [rebuild ] Rebuild the drive indicated by the index
    [daemon_sleep ] Time between bitmap daemon work to clear bits
    [min_recovery_rate ] Throttle RAID initialization
    [max_recovery_rate ] Throttle RAID initialization
    [max_write_behind ] See '-write-behind=' (man mdadm)
    [stripe_cache ] Stripe cache size for higher RAIDs

    Line 3 contains the list of devices that compose the array in
    metadata/data device pairs. If the metadata is stored separately, a '-'
    is given for the metadata device position. If a drive has failed or is
    missing at creation time, a '-' can be given for both the metadata and
    data drives for a given position.

    Examples:
    # RAID4 - 4 data drives, 1 parity
    # No metadata devices specified to hold superblock/bitmap info
    # Chunk size of 1MiB
    # (Lines separated for easy reading)
    0 1960893648 raid \
    raid4 1 2048 \
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    # RAID4 - 4 data drives, 1 parity (no metadata devices)
    # Chunk size of 1MiB, force RAID initialization,
    # min recovery rate at 20 kiB/sec/disk
    0 1960893648 raid \
    raid4 4 2048 min_recovery_rate 20 sync\
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    Performing a 'dmsetup table' should display the CTR table used to
    construct the mapping (with possible reordering of optional
    parameters).

    Performing a 'dmsetup status' will yield information on the state and
    health of the array. The output is as follows:
    1: raid \
    2:

    Line 1 is standard DM output. Line 2 is best shown by example:
    0 1960893648 raid raid4 5 AAAAA 2/490221568
    Here we can see the RAID type is raid4, there are 5 devices - all of
    which are 'A'live, and the array is 2/490221568 complete with recovery.

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • Add per-target unplug callback support.

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • DM currently implements congestion checking by checking on congestion
    in each component device. For raid456 we need to also check if the
    stripe cache is congested.

    Add per-target congestion checker callback support.

    Extending the target_callbacks structure with additional callback
    functions allows for establishing multiple callbacks per-target (a
    callback is also needed for unplug).

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     
  • This patch adds a user-configurable 'pg_init_delay_msecs' feature. Use
    this feature to specify the number of milliseconds to delay before
    retrying scsi_dh_activate, when SCSI_DH_RETRY is returned.

    SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry
    activation immediately and SCSI_DH_RETRY in cases where it is better to
    retry after some delay.

    Currently we immediately retry scsi_dh_activate irrespective of
    SCSI_DH_IMM_RETRY and SCSI_DH_RETRY.

    The 'pg_init_delay_msecs' feature may be provided during table create or
    load, e.g.:
    dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \
    pg_init_delay_msecs 2500 ..." mpatha

    The default for 'pg_init_delay_msecs' is 2000 milliseconds.
    Maximum configurable delay is 60000 milliseconds. Specifying a
    'pg_init_delay_msecs' of 0 will cause immediate retry.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Chandra Seetharaman
    Acked-by: Mike Christie
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
    This patch is just a clean-up and no functional change.

    The spin_lock_irq() was leftover from the early request-based dm code,
    where map_request() used to enable interrupts.
    Since current map_request() never enables interrupts, we can change it
    to spin_lock() to match the prior spin_unlock().

    Auditing through the dm and block-layer code called from
    map_request(), I confirmed all functions save/restore interrupt
    status, so no function returning with interrupts enabled.
    Also I haven't observed any problem on my test environment which
    uses scsi and lpfc driver after heavy I/O testing with occasional
    path down/up.

    Added BUG_ON() to detect breakage in future.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • It's nicer to return the PTR_ERR() value instead of just returning
    -ENOMEM. In the current code the PTR_ERR() value is always equal to
    -ENOMEM so this doesn't actually affect anything, but still...

    In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM
    return. So this change is safe relative to potential for a non -ENOMEM
    return in the future.

    Signed-off-by: Dan Carpenter
    Acked-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Dan Carpenter