23 Dec, 2011

4 commits

  • commit d0a4bb492772ce5c4bdfba3744a99ed6f6fb238f introduced a
    regression which is annoying but fairly harmless.

    When writing to an array that is undergoing recovery (a spare
    in being integrated into the array), writing to the array will
    set bits in the bitmap, but they will not be cleared when the
    write completes.

    For bits covering areas that have not been recovered yet this is not a
    problem as the recovery will clear the bits. However bits set in
    already-recovered region will stay set and never be cleared.
    This doesn't risk data integrity. The only negatives are:
    - next time there is a crash, more resyncing than necessary will
    be done.
    - the bitmap doesn't look clean, which is confusing.

    While an array is recovering we don't want to update the
    'events_cleared' setting in the bitmap but we do still want to clear
    bits that have very recently been set - providing they were written to
    the recovering device.

    So split those two needs - which previously both depended on 'success'
    and always clear the bit of the write went to all devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Before performing a recovery we try to remove any spares that
    might not be working, then add any that might have become relevant.

    Currently we abort on the first spare that cannot be added.
    This is a false optimisation.
    It is conceivable that - depending on rules in the personality - a
    subsequent spare might be accepted.
    Also the loop does other things like count the available spares and
    reset the 'recovery_offset' value.

    If we abort early these might not happen properly.

    So remove the early abort.

    In particular if you have an array what is undergoing recovery and
    which has extra spares, then the recovery may not restart after as
    reboot as the could of 'spares' might end up as zero.

    Reported-by: Anssi Hannula
    Signed-off-by: NeilBrown

    NeilBrown
     
  • While reshaping a degraded array (as when reshaping a RAID0 by first
    converting it to a degraded RAID4) we currently get confused about
    which devices are in_sync. In most cases we get it right, but in the
    region that is being reshaped we need to treat non-failed devices as
    in-sync when we have the data but haven't actually written it out yet.

    Reported-by: Adam Kwolek
    Signed-off-by: NeilBrown

    NeilBrown
     
  • commit d70ed2e4fafdbef0800e73942482bb075c21578b
    broke hot-add to a linear array.
    After that commit, metadata if not written to devices until they
    have been fully integrated into the array as determined by
    saved_raid_disk. That patch arranged to clear that field after
    a recovery completed.

    However for linear arrays, there is no recovery - the integration is
    instantaneous. So we need to explicitly clear the saved_raid_disk
    field.

    Signed-off-by: NeilBrown

    NeilBrown
     

09 Dec, 2011

1 commit


08 Dec, 2011

5 commits

  • Once a device is failed we really want to completely ignore it.
    It should go away soon anyway.

    In particular the presence of bad blocks on it should not cause us to
    block as we won't be trying to write there anyway.

    So as soon as we can check if a device is Faulty, do so and pretend
    that it is already gone if it is Faulty.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When we mark blocks as bad we need them to be acknowledged by the
    metadata handler promptly.

    For an in-kernel metadata handler that was already being done. But
    for an external metadata handler we need to alert it of the change by
    sending a notification through the sysfs file. This adds that
    notification.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Once a device is marked Faulty the badblocks - whether acknowledged or
    not - become irrelevant. So they shouldn't cause the device to be
    marked as Blocked.

    Without this patch, a process might write "-blocked" to clear the
    Blocked status, but while that will correctly fail the device, it
    won't remove the apparent 'blocked' status.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When we are accessing an mddev via sysfs we know that the
    mddev cannot disappear because it has an embedded kobj which
    is refcounted by sysfs.
    And we also take the mddev_lock.
    However this is not enough.

    The final mddev_put could have been called and the
    mddev_delayed_delete is waiting for sysfs to let go so it can destroy
    the kobj and mddev.
    In this state there are a lot of changes that should not be attempted.

    To to guard against this we:
    - initialise mddev->all_mddevs in on last put so the state can be
    easily detected.
    - in md_attr_show and md_attr_store, check ->all_mddevs under
    all_mddevs_lock and mddev_get the mddev if it still appears to
    be active.

    This means that if we get to sysfs as the mddev is being deleted we
    will get -EBUSY.

    rdev_attr_store and rdev_attr_show are similar but already have
    sufficient protection. They check that rdev->mddev still points to
    mddev after taking mddev_lock. As this is cleared before delayed
    removal which can only be requested under the mddev_lock, this
    ensure the rdev and mddev are still alive.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We like md devices to disappear when they really are not needed.
    However it is not possible to tell from the current state whether it
    is needed or not. We can only tell from recent history of changes.

    In particular immediately after we create an md device it looks very
    similar to immediately after we have finished with it.

    So we always preserve a newly created md device until something
    significant happens. This state is stored in 'hold_active'.

    The normal case is to keep it until an ioctl happens, as that will
    normally either activate it, or explicitly de-activate it. If it
    doesn't then it was probably created by mistake and it is now time to
    get rid of it.

    We can also modify an array via sysfs (instead of via ioctl) and we
    currently treat any change via sysfs like an ioctl as a sign that if
    it now isn't more active, it should be destroyed.
    However this is not appropriate as changes made via sysfs are more
    gradual so we should look for a more definitive change.

    So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
    the array_state is changed via sysfs. Other changes via sysfs
    are ignored.

    Signed-off-by: NeilBrown

    NeilBrown
     

23 Nov, 2011

1 commit

  • Page attributes are set using __set_bit rather than set_bit as
    it normally called under a spinlock so the extra atomicity is not
    needed.

    However there are two places where we might set or clear page
    attributes without holding the spinlock.
    So add the spinlock in those cases.

    This might be the cause of occasional reports that bits a aren't
    getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost
    when BITMAP_PAGE_NEEDWRITE is set or cleared. This is an
    inconvenience, not a threat to data safety.

    Signed-off-by: NeilBrown

    NeilBrown
     

08 Nov, 2011

5 commits


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm:
    dm: raid fix device status indicator when array initializing
    dm log userspace: add log device dependency
    dm log userspace: fix comment hyphens
    dm: add thin provisioning target
    dm: add persistent data library
    dm: add bufio
    dm: export dm get md
    dm table: add immutable feature
    dm table: add always writeable feature
    dm table: add singleton feature
    dm kcopyd: add dm_kcopyd_zero to zero an area
    dm: remove superfluous smp_mb
    dm: use local printk ratelimit
    dm table: propagate non rotational flag

    Linus Torvalds
     

01 Nov, 2011

17 commits

  • These files were getting the defines for EXPORT_SYMBOL because
    device.h was including module.h. But we are going to put an
    end to that. So add the proper export.h include now.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • A pending cleanup will mean that module.h won't be implicitly
    everywhere anymore. Make sure the modular drivers in md dir
    are actually calling out for explicitly in advance.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • * 'for-linus' of git://neil.brown.name/md:
    md/raid10: Fix bug when activating a hot-spare.

    Linus Torvalds
     
  • When devices in a RAID array are not in-sync, they are supposed to be
    reported as such in the status output as an 'a' character, which means
    "alive, but not in-sync". But when the entire array is rebuilt 'A' is
    being used, which is incorrect. This patch corrects this to 'a'.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan E Brassow
     
  • Allow userspace dm log implementations to register their log device so it
    is no longer missing from the list of device dependencies.

    When device mapper targets use a device they normally call dm_get_device
    which includes it in the device list returned to userspace applications
    such as LVM through the DM_TABLE_DEPS ioctl. Userspace log devices
    don't use dm_get_device as userspace opens them so they are missing from
    the list of dependencies.

    This patch extends the DM_ULOG_CTR operation to allow userspace to
    respond with the name of the log device (if appropriate) to be
    registered via 'dm_get_device'. DM_ULOG_REQUEST_VERSION is incremented.

    This is backwards compatible. If the kernel and userspace log server
    have both been updated, the new information will be passed down to the
    kernel and the device will be registered. If the kernel is new, but
    the log server is old, the log server will not pass down any device
    information and the kernel will simply bypass the device registration
    as before. If the kernel is old but the log server is new, the log
    server will see the old version number and not pass the device info.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan E Brassow
     
  • Fix comments: clustered-disk needs a hyphen not an underscore.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Initial EXPERIMENTAL implementation of device-mapper thin provisioning
    with snapshot support. The 'thin' target is used to create instances of
    the virtual devices that are hosted in the 'thin-pool' target. The
    thin-pool target provides data sharing among devices. This sharing is
    made possible using the persistent-data library in the previous patch.

    The main highlight of this implementation, compared to the previous
    implementation of snapshots, is that it allows many virtual devices to
    be stored on the same data volume, simplifying administration and
    allowing sharing of data between volumes (thus reducing disk usage).

    Another big feature is support for arbitrary depth of recursive
    snapshots (snapshots of snapshots of snapshots ...). The previous
    implementation of snapshots did this by chaining together lookup tables,
    and so performance was O(depth). This new implementation uses a single
    data structure so we don't get this degradation with depth.

    For further information and examples of how to use this, please read
    Documentation/device-mapper/thin-provisioning.txt

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • The persistent-data library offers a re-usable framework for the storage
    and management of on-disk metadata in device-mapper targets.

    It's used by the thin-provisioning target in the next patch and in an
    upcoming hierarchical storage target.

    For further information, please read
    Documentation/device-mapper/persistent-data.txt

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • The dm-bufio interface allows you to do cached I/O on devices,
    holding recently-read blocks in memory and performing delayed writes.

    We don't use buffer cache or page cache already present in the kernel, because:
    * we need to handle block sizes larger than a page
    * we can't allocate memory to perform reads or we'd have deadlocks

    Currently, when a cache is required, we limit its size to a fraction of
    available memory. Usage can be viewed and changed in
    /sys/module/dm_bufio/parameters/ .

    The first user is thin provisioning, but more dm users are planned.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Export dm_get_md() for the new thin provisioning target to use.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
    with any other target type, and once loaded into a device, it cannot be
    replaced with a table containing a different type.

    The thin provisioning pool device will use this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
    does not support read-only mode.

    The initial implementation of the thin provisioning target uses this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Introduce the concept of a singleton table which contains exactly one target.

    If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
    will ensure that any table that includes that target contains no others.

    The thin provisioning pool target uses this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • This patch introduces dm_kcopyd_zero() to make it easy to use
    kcopyd to write zeros into the requested areas instead
    instead of copying. It is implemented by passing a NULL
    copying source to dm_kcopyd_copy().

    The forthcoming thin provisioning target uses this.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Since set_current_state() contains a memory barrier in it,
    an additional barrier isn't needed.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Alasdair G Kergon

    Namhyung Kim
     
  • printk_ratelimit() shares global ratelimiting state with all
    other subsystems, so its usage is discouraged. Instead,
    define and use dm's local state.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Alasdair G Kergon

    Namhyung Kim
     
  • Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
    underlying devices are non-rotational. Tools like ureadahead will
    schedule IOs differently based on the rotational flag.

    With this patch, I see boot time go from 7.75 s to 7.46 s on my device.

    Suggested-by: J. Richard Barnette
    Signed-off-by: Mandeep Singh Baines
    Signed-off-by: Mike Snitzer
    Cc: Neil Brown
    Cc: Jens Axboe
    Cc: Martin K. Petersen
    Cc: dm-devel@redhat.com
    Signed-off-by: Alasdair G Kergon

    Mandeep Singh Baines
     

31 Oct, 2011

1 commit

  • This is a fairly serious bug in RAID10.

    When a RAID10 array is degraded and a hot-spare is activated, the
    spare does not take up the empty slot, but rather replaces the first
    working device.
    This is likely to make the array non-functional. It would normally
    be possible to recover the data, but that would need care and is not
    guaranteed.

    This bug was introduced in commit
    2bb77736ae5dca0a189829fbb7379d43364a9dac
    which first appeared in 3.1.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

27 Oct, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md: (34 commits)
    md: Fix some bugs in recovery_disabled handling.
    md/raid5: fix bug that could result in reads from a failed device.
    lib/raid6: Fix filename emitted in generated code
    md.c: trivial comment fix
    MD: Allow restarting an interrupted incremental recovery.
    md: clear In_sync bit on devices added to an active array.
    md: add proper write-congestion reporting to RAID1 and RAID10.
    md: rename "mdk_personality" to "md_personality"
    md/bitmap remove fault injection options.
    md/raid5: typedef removal: raid5_conf_t -> struct r5conf
    md/raid1: typedef removal: conf_t -> struct r1conf
    md/raid10: typedef removal: conf_t -> struct r10conf
    md/raid0: typedef removal: raid0_conf_t -> struct r0conf
    md/multipath: typedef removal: multipath_conf_t -> struct mpconf
    md/linear: typedef removal: linear_conf_t -> struct linear_conf
    md/faulty: remove typedef: conf_t -> struct faulty_conf
    md/linear: remove typedefs: dev_info_t -> struct dev_info
    md: remove typedefs: mirror_info_t -> struct mirror_info
    md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
    md: remove typedefs: mdk_thread_t -> struct md_thread
    ...

    Linus Torvalds
     

26 Oct, 2011

2 commits

  • In 3.0 we changed the way recovery_disabled was handle so that instead
    of testing against zero, we test an mddev-> value against a conf->
    value.
    Two problems:
    1/ one place in raid1 was missed and still sets to '1'.
    2/ We didn't explicitly set the conf-> value at array creation
    time.
    It defaulted to '0' just like the mddev value does so they
    could appear equal and thus disable recovery.
    This did not affect normal 'md' as it calls bind_rdev_to_array
    which changes the mddev value. However the dmraid interface
    doesn't call this and so doesn't change ->recovery_disabled; so at
    array start all recovery is incorrectly disabled.

    So initialise the 'conf' value to one less that the mddev value, so
    the will only be the same when explicitly set that way.

    Reported-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9
    which was in 2.6.36.

    There is a small window of time between when a device fails and when
    it is removed from the array. During this time we might still read
    from it, but we won't write to it - so it is possible that we could
    read stale data.

    We didn't need the test of 'Faulty' before because the test on
    In_sync is sufficient. Since we started allowing reads from the early
    part of non-In_sync devices we need a test on Faulty too.

    This is suitable for any kernel from 2.6.36 onwards, though the patch
    might need a bit of tweaking in 3.0 and earlier.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown