23 Dec, 2011

2 commits

  • hot-replace is a feature being added to md which will allow a
    device to be replaced without removing it from the array first.

    With hot-replace a spare can be activated and recovery can start while
    the original device is still in place, thus allowing a transition from
    an unreliable device to a reliable device without leaving the array
    degraded during the transition. It can also be use when the original
    device is still reliable but it not wanted for some reason.

    This will eventually be supported in RAID4/5/6 and RAID10.

    This patch adds a super-block flag to distinguish the replacement
    device. If an old kernel sees this flag it will reject the device.

    It also adds two per-device flags which are viewable and settable via
    sysfs.
    "want_replacement" can be set to request that a device be replaced.
    "replacement" is set to show that this device is replacing another
    device.

    The "rd%d" links in /sys/block/mdXx/md only apply to the original
    device, not the replacement. We currently don't make links for the
    replacement - there doesn't seem to be a need.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Soon an array will be able to have multiple devices with the
    same raid_disk number (an original and a replacement). So removing
    a device based on the number won't work. So pass the actual device
    handle instead.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

19 Oct, 2011

1 commit


11 Oct, 2011

4 commits


21 Sep, 2011

2 commits

  • Signed-off-by: Wang Sheng-Hui
    Signed-off-by: NeilBrown

    Wang Sheng-Hui
     
  • Two related problems:

    1/ some error paths call "md_unregister_thread(mddev->thread)"
    without subsequently clearing ->thread. A subsequent call
    to mddev_unlock will try to wake the thread, and crash.

    2/ Most calls to md_wakeup_thread are protected against the thread
    disappeared either by:
    - holding the ->mutex
    - having an active request, so something else must be keeping
    the array active.
    However mddev_unlock calls md_wakeup_thread after dropping the
    mutex and without any certainty of an active request, so the
    ->thread could theoretically disappear.
    So we need a spinlock to provide some protections.

    So change md_unregister_thread to take a pointer to the thread
    pointer, and ensure that it always does the required locking, and
    clears the pointer properly.

    Reported-by: "Moshe Melnikov"
    Signed-off-by: NeilBrown
    cc: stable@kernel.org

    NeilBrown
     

12 Sep, 2011

1 commit

  • There is very little benefit in allowing to let a ->make_request
    instance update the bios device and sector and loop around it in
    __generic_make_request when we can archive the same through calling
    generic_make_request from the driver and letting the loop in
    generic_make_request handle it.

    Note that various drivers got the return value from ->make_request and
    returned non-zero values for errors.

    Signed-off-by: Christoph Hellwig
    Acked-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jul, 2011

5 commits

  • It is only safe to choose not to write to a bad block if that bad
    block is safely recorded in metadata - i.e. if it has been
    'acknowledged'.

    If it hasn't we need to wait for the acknowledgement.

    We support that using rdev->blocked wait and
    md_wait_for_blocked_rdev by introducing a new device flag
    'BlockedBadBlock'.

    This flag is only advisory.
    It is cleared whenever we acknowledge a bad block, so that a waiter
    can re-check the particular bad blocks that it is interested it.

    It should be set by a caller when they find they need to wait.
    This (set after test) is inherently racy, but as
    md_wait_for_blocked_rdev already has a timeout, losing the race will
    have minimal impact.

    When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
    was set incorrectly (see above race).

    We also modify the way we manage 'Blocked' to fit better with the new
    handling of 'BlockedBadBlocks' and to make it consistent between
    externally managed and internally managed metadata. This requires
    that each raidXd loop checks if the metadata needs to be written and
    triggers a write (md_check_recovery) if needed. Otherwise a queued
    write request might cause raidXd to wait for the metadata to write,
    and only that thread can write it.

    Before writing metadata, we set FaultRecorded for all devices that
    are Faulty, then after writing the metadata we clear Blocked for any
    device for which the Fault was certainly Recorded.

    The 'faulty' device flag now appears in sysfs if the device is faulty
    *or* it has unacknowledged bad blocks. So user-space which does not
    understand bad blocks can continue to function correctly.
    User space which does, should not assume a device is faulty until it
    sees the 'faulty' flag, and then sees the list of unacknowledged bad
    blocks is empty.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a device has ever seen a write error, we will want to handle
    known-bad-blocks differently.
    So create an appropriate state flag and export it via sysfs.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • Now that we have a bad block list, we should not read from those
    blocks.
    There are several main parts to this:
    1/ read_balance needs to check for bad blocks, and return not only
    the chosen device, but also how many good blocks are available
    there.
    2/ fix_read_error needs to avoid trying to read from bad blocks.
    3/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio. This count is stored in 'bi_phys_segments'
    4/ retrying a read needs to also be ready to submit a smaller read
    and queue another request for the rest.

    This does not yet handle bad blocks when reading to perform resync,
    recovery, or check.

    'md_trim_bio' will also be used for RAID10, so put it in md.c and
    export it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Space must have been allocated when array was created.
    A feature flag is set when the badblock list is non-empty, to
    ensure old kernels don't load and trust the whole device.

    We only update the on-disk badblocklist when it has changed.
    If the badblocklist (or other metadata) is stored on a bad block, we
    don't cope very well.

    If metadata has no room for bad block, flag bad-blocks as disabled,
    and do the same for 0.90 metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This the first step in allowing md to track bad-blocks per-device so
    that we can fail individual blocks rather than the whole device.

    This patch just adds a data structure for recording bad blocks, with
    routines to add, remove, search the list.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     

27 Jul, 2011

3 commits

  • Revert most of commit e384e58549a2e9a83071ad80280c1a9053cfd84c
    md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

    MD should not need to use DM's dirty log - we decided to use md's
    bitmaps instead.

    Keeping the DIV_ROUND_UP clean-ups that were part of commit
    e384e58549a2e9a83071ad80280c1a9053cfd84c, however.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • If we hit a read error while recovering a mirror, we want to abort the
    recovery without necessarily failing the disk - as having a disk this
    a read error is better than not having an array at all.

    Currently this is managed with a per-array flag "recovery_disabled"
    and is only implemented for RAID1. For RAID10 we will need finer
    grained control as we might want to disable recovery for individual
    devices separately.

    So push more of the decision making into the personality.
    'recovery_disabled' is now a 'cookie' which is copied when the
    personality want to disable recovery and is changed when a device is
    added to the array as this is used as a trigger to 'try recovery
    again'.

    This will allow RAID10 to get the control that it needs.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • There are places where sysfs links to rdev are handled
    in a same way. Add the helper functions to consolidate
    them.

    Signed-off-by: Namhyung Kim
    Signed-off-by: NeilBrown

    Namhyung Kim
     

09 Jun, 2011

1 commit


08 Jun, 2011

1 commit

  • Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

    If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
    able to load it, there must still be a way for MD to initiate superblock
    updates. The simplest way to make this happen is to provide a pointer in
    the MD array structure that can be set by device-mapper (or other module)
    with a function to do this. If the function has been set, it will be used;
    otherwise, the method with be looked up via 'super_types' as usual.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     

18 Apr, 2011

2 commits

  • When an md device adds a request to a queue, it can call
    mddev_check_plugged.
    If this succeeds then we know that the md thread will be woken up
    shortly, and ->plug_cnt will be non-zero until then, so some
    processing can be delayed.

    If it fails, then no unplug callback is expected and the make_request
    function needs to do whatever is required to make the request happen.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md has some plugging infrastructure for RAID5 to use because the
    normal plugging infrastructure required a 'request_queue', and when
    called from dm, RAID5 doesn't have one of those available.

    This relied on the ->unplug_fn callback which doesn't exist any more.

    So remove all of that code, both in md and raid5. Subsequent patches
    with restore the plugging functionality.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2011

1 commit


24 Feb, 2011

1 commit

  • Revert
    b821eaa572fd737faaf6928ba046e571526c36c6
    and
    f3b99be19ded511a1bf05a148276239d9f13eefa

    When I wrote the first of these I had a wrong idea about the
    lifetime of 'struct block_device'. It can disappear at any time that
    the block device is not open if it falls out of the inode cache.

    So relying on the 'size' recorded with it to detect when the
    device size has changed and so we need to revalidate, is wrong.

    Rather, we really do need the 'changed' attribute stored directly in
    the mddev and set/tested as appropriate.

    Without this patch, a sequence of:
    mknod / open / close / unlink

    (which can cause a block_device to be created and then destroyed)
    will result in a rescan of the partition table and consequence removal
    and addition of partitions.
    Several of these in a row can get udev racing to create and unlink and
    other code can get confused.

    With the patch, the rescan is only performed when needed and so there
    are no races.

    This is suitable for any stable kernel from 2.6.35.

    Reported-by: "Wojcik, Krzysztof"
    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    NeilBrown
     

31 Jan, 2011

1 commit

  • This flag is not needed and is used badly.

    Devices that are included in a native-metadata array are reserved
    exclusively for that array - and currently have AllReserved set.
    They all are bd_claimed for the rdev and so cannot be shared.

    Devices that are included in external-metadata arrays can be shared
    among multiple arrays - providing there is no overlap.
    These are bd_claimed for md in general - not for a particular rdev.

    When changing the amount of a device that is used in an array we need
    to check for overlap. This currently includes a check on AllReserved
    So even without overlap, sharing with an AllReserved device is not
    allowed.
    However the bd_claim usage already precludes sharing with these
    devices, so the test on AllReserved is not needed. And in fact it is
    wrong.

    As this is the only use of AllReserved, simply remove all usage and
    definition of AllReserved.

    Signed-off-by: NeilBrown

    NeilBrown
     

14 Jan, 2011

3 commits

  • Allow the metadata to be on a separate device from the
    data.

    This doesn't mean the data and metadata will by on separate
    physical devices - it simply gives device-mapper and userspace
    tools more flexibility.

    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Add new parameter to 'sync_page_io'.

    The new parameter allows us to distinguish between metadata and data
    operations. This becomes important later when we add the ability to
    use separate devices for data and metadata.

    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • When an md device is in the process of coming on line it is possible
    for an IO request (typically a partition table probe) to get through
    before the array is fully initialised, which can cause unexpected
    behaviour (e.g. a crash).

    So explicitly record when the array is ready for IO and don't allow IO
    through until then.

    There is no possibility for a similar problem when the array is going
    off-line as there must only be one 'open' at that time, and it is busy
    off-lining the array and so cannot send IO requests. So no memory
    barrier is needed in md_stop()

    This has been a bug since commit 409c57f3801 in 2.6.30 which
    introduced md_make_request. Before then, each personality would
    register its own make_request_fn when it was ready.
    This is suitable for any stable kernel from 2.6.30.y onwards.

    Cc:
    Signed-off-by: NeilBrown
    Reported-by: "Hawrylewicz Czarnowski, Przemyslaw"

    NeilBrown
     

28 Oct, 2010

2 commits

  • bio_clone and bio_alloc allocate from a common bio pool.
    If an md device is stacked with other devices that use this pool, or under
    something like swap which uses the pool, then the multiple calls on
    the pool can cause deadlocks.

    So allocate a local bio pool for each md array and use that rather
    than the common pool.

    This pool is used both for regular IO and metadata updates.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Currently sync_page_io takes a 'bdev'.
    Every caller passes 'rdev->bdev'.
    We will soon want another field out of the rdev in sync_page_io,
    So just pass the rdev instead of the bdev out of it.

    Signed-off-by: NeilBrown

    NeilBrown
     

19 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • This patch converts md to support REQ_FLUSH/FUA instead of now
    deprecated REQ_HARDBARRIER. In the core part (md.c), the following
    changes are notable.

    * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
    processing of other requests and thus there is no reason to mark the
    queue congested while FLUSH/FUA is in progress.

    * REQ_FLUSH/FUA failures are final and its users don't need retry
    logic. Retry logic is removed.

    * Preflush needs to be issued to all member devices but FUA writes can
    be handled the same way as other writes - their processing can be
    deferred to request_queue of member devices. md_barrier_request()
    is renamed to md_flush_request() and simplified accordingly.

    For linear, raid0 and multipath, the core changes are enough. raid1,
    5 and 10 need the following conversions.

    * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
    request_queues of member devices. Barrier related logic removed.

    * raid5: Queue draining logic dropped. FUA bit is propagated through
    biodrain and stripe resconstruction such that all the updated parts
    of the stripe are written out with FUA writes if any of the dirtying
    writes was FUA. preread_active_stripes handling in make_request()
    is updated as suggested by Neil Brown.

    * raid10: FUA bit needs to be propagated to write clones.

    linear, raid0, 1, 5 and 10 tested.

    Signed-off-by: Tejun Heo
    Reviewed-by: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

30 Aug, 2010

1 commit

  • MD_CHANGE_CLEAN is used for two different purposes and this leads to
    confusion.
    One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
    not used for anything else, so have MD_CHANGE_PENDING take over that
    purpose fully.

    The two purposes are:
    1/ tell md_update_sb that an update is needed and that it is just a
    clean/dirty transition.
    2/ tell user-space that an transition from clean to dirty is pending
    (something wants to write), and tell te kernel (by clearin the
    flag) that the transition is OK.

    The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
    fully to MD_CHANGE_PENDING.

    This means that various places which conditionally set or cleared
    MD_CHANGE_CLEAN no longer need to be conditional.

    Signed-off-by: NeilBrown

    NeilBrown
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://neil.brown.name/md: (24 commits)
    md: clean up do_md_stop
    md: fix another deadlock with removing sysfs attributes.
    md: move revalidate_disk() back outside open_mutex
    md/raid10: fix deadlock with unaligned read during resync
    md/bitmap: separate out loading a bitmap from initialising the structures.
    md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
    md/bitmap: optimise scanning of empty bitmaps.
    md/bitmap: clean up plugging calls.
    md/bitmap: reduce dependence on sysfs.
    md/bitmap: white space clean up and similar.
    md/raid5: export raid5 unplugging interface.
    md/plug: optionally use plugger to unplug an array during resync/recovery.
    md/raid5: add simple plugging infrastructure.
    md/raid5: export is_congested test
    raid5: Don't set read-ahead when there is no queue
    md: add support for raising dm events.
    md: export various start/stop interfaces
    md: split out md_rdev_init
    md: be more careful setting MD_CHANGE_CLEAN
    md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
    ...

    Linus Torvalds
     

08 Aug, 2010

2 commits

  • Move the deletion of sysfs attributes from reconfig_mutex to
    open_mutex didn't really help as a process can try to take
    open_mutex while holding reconfig_mutex, so the same deadlock can
    happen, just requiring one more process to be involved in the chain.

    I looks like I cannot easily use locking to wait for the sysfs
    deletion to complete, so don't.

    The only things that we cannot do while the deletions are still
    pending is other things which can change the sysfs namespace: run,
    takeover, stop. Each of these can fail with -EBUSY.
    So set a flag while doing a sysfs deletion, and fail run, takeover,
    stop if that flag is set.

    This is suitable for 2.6.35.x

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Jul, 2010

3 commits

  • This allows md/raid5 to fully work as a dm target.

    Normally md uses a 'filemap' which contains a list of pages of bits
    each of which may be written separately.
    dm-log uses and all-or-nothing approach to writing the log, so
    when using a dm-log, ->filemap is NULL and the flags normally stored
    in filemap_attr are stored in ->logattrs instead.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • 1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
    arrays with no queue attached.

    2/ Don't bother plugging the queue when we set a bit in the bitmap.
    The reason for this was to encourage as many bits as possible to
    get set before we unplug and write stuff out.
    However every personality already plugs the queue after
    bitmap_startwrite either directly (raid1/raid10) or be setting
    STRIPE_BIT_DELAY which causes the queue to be plugged later
    (raid5).

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Fixes some whitespace problems
    Fixed some checkpatch.pl complaints.
    Replaced kmalloc ... memset(0), with kzalloc
    Fixed an unlikely memory leak on an error path.
    Reformatted a number of 'if/else' sets, sometimes
    replacing goto with an else clause.
    Removed some old comments and commented-out code.

    Signed-off-by: NeilBrown

    NeilBrown