05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

2 commits

  • The recent comment:
    commit 7e83ccbecd608b971f340e951c9e84cd0343002f
    md/raid10: Allow skipping recovery when clean arrays are assembled

    Causes raid10 to skip a recovery in certain cases where it is safe to
    do so. Unfortunately it also causes a reshape to be skipped which is
    never safe. The result is that an attempt to reshape a RAID10 will
    appear to complete instantly, but no data will have been moves so the
    array will now contain garbage.
    (If nothing is written, you can recovery by simple performing the
    reverse reshape which will also complete instantly).

    Bug was introduced in 3.10, so this is suitable for 3.10-stable.

    Cc: stable@vger.kernel.org (3.10)
    Cc: Martin Wilck
    Signed-off-by: NeilBrown

    NeilBrown
     
  • There is a bug in 'check_reshape' for raid5.c To checks
    that the new minimum number of devices is large enough (which is
    good), but it does so also after the reshape has started (bad).

    This is bad because
    - the calculation is now wrong as mddev->raid_disks has changed
    already, and
    - it is pointless because it is now too late to stop.

    So only perform that test when reshape has not been committed to.

    Signed-off-by: NeilBrown

    NeilBrown
     

03 Jul, 2013

1 commit

  • 1/ If a RAID10 is being reshaped to a fewer number of devices
    and is stopped while this is ongoing, then when the array is
    reassembled the 'mirrors' array will be allocated too small.
    This will lead to an access error or memory corruption.

    2/ A sanity test for a reshaping RAID10 array is restarted
    is slightly incorrect.

    Due to the first bug, this is suitable for any -stable
    kernel since 3.5 where this code was introduced.

    Cc: stable@vger.kernel.org (v3.5+)
    Signed-off-by: NeilBrown

    NeilBrown
     

26 Jun, 2013

2 commits

  • MD: Remember the last sync operation that was performed

    This patch adds a field to the mddev structure to track the last
    sync operation that was performed. This is especially useful when
    it comes to what is recorded in mismatch_cnt in sysfs. If the
    last operation was "data-check", then it reports the number of
    descrepancies found by the user-initiated check. If it was a
    "repair" operation, then it is reporting the number of
    descrepancies repaired. etc.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • RAID5 uses a 'per-array' value for the 'size' of each device.
    RAID0 uses a 'per-device' value - it can be different for each device.

    When converting a RAID5 to a RAID0 we must ensure that the per-device
    size of each device matches the per-array size for the RAID5, else
    the array will change size.

    If the metadata cannot record a changed per-device size (as is the
    case with v0.90 metadata) the array could get bigger on restart. This
    does not cause data corruption, so it not a big issue and is mainly
    yet another a reason to not use 0.90.

    Signed-off-by: NeilBrown

    NeilBrown
     

18 Jun, 2013

1 commit


14 Jun, 2013

9 commits

  • It isn't really enough to check that the rdev is present, we need to
    also be sure that the device is still In_sync.

    Doing this requires using rcu_dereference to access the rdev, and
    holding the rcu_read_lock() to ensure the rdev doesn't disappear while
    we look at it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As 'enough' accesses conf->prev and conf->geo, which can change
    spontanously, it should guard against changes.
    This can be done with device_lock as start_reshape holds device_lock
    while updating 'geo' and end_reshape holds it while updating 'prev'.

    So 'error' needs to hold 'device_lock'.

    On the other hand, raid10_end_read_request knows which of the two it
    really wants to access, and as it is an active request on that one,
    the value cannot change underneath it.

    So change _enough to take flag rather than a pointer, pass the
    appropriate flag from raid10_end_read_request(), and remove the locking.

    All other calls to 'enough' are made with reconfig_mutex held, so
    neither 'prev' nor 'geo' can change.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The usage of strict_strtoul() is not preferred, because
    strict_strtoul() is obsolete. Thus, kstrtoul() should be
    used.

    Signed-off-by: Jingoo Han
    Signed-off-by: NeilBrown

    Jingoo Han
     
  • When a device has failed, it needs to be removed from the personality
    module before it can be removed from the array as a whole.
    The first step is performed by md_check_recovery() which is called
    from the raid management thread.

    So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
    to have run. This increases the chance that the ioctl will succeed.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Neil Brown

    Hannes Reinecke
     
  • This doesn't really need to be initialised, but it doesn't hurt,
    silences the compiler, and as it is a counter it makes sense for it to
    start at zero.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • DM RAID: Fix raid_resume not reviving failed devices in all cases

    When a device fails in a RAID array, it is marked as Faulty. Later,
    md_check_recovery is called which (through the call chain) calls
    'hot_remove_disk' in order to have the personalities remove the device
    from use in the array.

    Sometimes, it is possible for the array to be suspended before the
    personalities get their chance to perform 'hot_remove_disk'. This is
    normally not an issue. If the array is deactivated, then the failed
    device will be noticed when the array is reinstantiated. If the
    array is resumed and the disk is still missing, md_check_recovery will
    be called upon resume and 'hot_remove_disk' will be called at that
    time. However, (for dm-raid) if the device has been restored,
    a resume on the array would cause it to attempt to revive the device
    by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called,
    a situation is then created where the device is thought to concurrently
    be the replacement and the device to be replaced. Thus, the device
    is first sync'ed with the rest of the array (because it is the replacement
    device) and then marked Faulty and removed from the array (because
    it is also the device being replaced).

    The solution is to check and see if the device had properly been removed
    before the array was suspended. This is done by seeing whether the
    device's 'raid_disk' field is -1 - a condition that implies that
    'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
    -> hot_remove_disk' has been called. If 'raid_disk' is not -1, then
    'hot_remove_disk' must be called to complete the removal of the previously
    faulty device before it can be revived via 'hot_add_disk'.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • DM RAID: Break-up untidy function

    Clean-up excessive indentation by moving some code in raid_resume()
    into its own function.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • DM RAID: Add ability to restore transiently failed devices on resume

    This patch adds code to the resume function to check over the devices
    in the RAID array. If any are found to be marked as failed and their
    superblocks can be read, an attempt is made to reintegrate them into
    the array. This allows the user to refresh the array with a simple
    suspend and resume of the array - rather than having to load a
    completely new table, allocate and initialize all the structures and
    throw away the old instantiation.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Pull md bugfixes from Neil Brown:
    "A few bugfixes for md

    Some tagged for -stable"

    * tag 'md-3.10-fixes' of git://neil.brown.name/md:
    md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
    md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
    md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
    md: md_stop_writes() should always freeze recovery.

    Linus Torvalds
     

13 Jun, 2013

5 commits

  • There are cases where the kernel will believe that the WRITE SAME
    command is supported by a block device which does not, in fact,
    support WRITE SAME. This currently happens for SATA drivers behind a
    SAS controller, but there are probably a hundred other ways that can
    happen, including drive firmware bugs.

    After receiving an error for WRITE SAME the block layer will retry the
    request as a plain write of zeroes, but mdraid will consider the
    failure as fatal and consider the drive failed. This has the effect
    that all the mirrors containing a specific set of data are each
    offlined in very rapid succession resulting in data loss.

    However, just bouncing the request back up to the block layer isn't
    ideal either, because the whole initial request-retry sequence should
    be inside the write bitmap fence, which probably means that md needs
    to do its own conversion of WRITE SAME to write zero.

    Until the failure scenario has been sorted out, disable WRITE SAME for
    raid1, raid5, and raid10.

    [neilb: added raid5]

    This patch is appropriate for any -stable since 3.7 when write_same
    support was added.

    Cc: stable@vger.kernel.org
    Signed-off-by: H. Peter Anvin
    Signed-off-by: NeilBrown

    H. Peter Anvin
     
  • Various places in raid1 and raid10 are calling raise_barrier when they
    really should call freeze_array.
    The former is only intended to be called from "make_request".
    The later has extra checks for 'nr_queued' and makes a call to
    flush_pending_writes(), so it is safe to call it from within the
    management thread.

    Using raise_barrier will sometimes deadlock. Using freeze_array
    should not.

    As 'freeze_array' currently expects one request to be pending (in
    handle_read_error - the only previous caller), we need to pass
    it the number of pending requests (extra) to ignore.

    The deadlock was made particularly noticeable by commits
    050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
    appeared in 3.4, so the fix is appropriate for any -stable
    kernel since then.

    This patch probably won't apply directly to some early kernels and
    will need to be applied by hand.

    Cc: stable@vger.kernel.org
    Reported-by: Alexander Lyakas
    Signed-off-by: NeilBrown

    NeilBrown
     
  • …ebuilding drive completed it.

    Without that fix, the following scenario could happen:

    - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
    - Drive A fails
    - WRITE request arrives to the array. It is failed by drive A, so
    r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
    succeeds in writing it, so the same r1_bio is marked as
    R1BIO_Uptodate.
    - r1_bio arrives to handle_write_finished, badblocks are disabled,
    md_error()->error() does nothing because we don't fail the last drive
    of raid1
    - raid_end_bio_io() calls call_bio_endio()
    - As a result, in call_bio_endio():
    if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
    clear_bit(BIO_UPTODATE, &bio->bi_flags);
    this code doesn't clear the BIO_UPTODATE flag, and the whole master
    WRITE succeeds, back to the upper layer.

    So we returned success to the upper layer, even though we had written
    the data onto the rebuilding drive only. But when we want to read the
    data back, we would not read from the rebuilding drive, so this data
    is lost.

    [neilb - applied identical change to raid10 as well]

    This bug can result in lost data, so it is suitable for any
    -stable kernel.

    Cc: stable@vger.kernel.org
    Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

    Alex Lyakas
     
  • __md_stop_writes() will currently sometimes freeze recovery.
    So any caller must be ready for that to happen, and indeed they are.

    However if __md_stop_writes() doesn't freeze_recovery, then
    a recovery could start before mddev_suspend() is called, which
    could be awkward. This can particularly cause problems or dm-raid.

    So change __md_stop_writes() to always freeze recovery. This is safe
    and more predicatable.

    Reported-by: Brassow Jonathan
    Tested-by: Brassow Jonathan
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Pull block layer fixes from Jens Axboe:
    "Outside of bcache (which really isn't super big), these are all
    few-liners. There are a few important fixes in here:

    - Fix blk pm sleeping when holding the queue lock

    - A small collection of bcache fixes that have been done and tested
    since bcache was included in this merge window.

    - A fix for a raid5 regression introduced with the bio changes.

    - Two important fixes for mtip32xx, fixing an oops and potential data
    corruption (or hang) due to wrong bio iteration on stacked devices."

    * 'for-linus' of git://git.kernel.dk/linux-block:
    scatterlist: sg_set_buf() argument must be in linear mapping
    raid5: Initialize bi_vcnt
    pktcdvd: silence static checker warning
    block: remove refs to XD disks from documentation
    blkpm: avoid sleep when holding queue lock
    mtip32xx: Correctly handle bio->bi_idx != 0 conditions
    mtip32xx: Fix NULL pointer dereference during module unload
    bcache: Fix error handling in init code
    bcache: clarify free/available/unused space
    bcache: drop "select CLOSURES"
    bcache: Fix incompatible pointer type warning

    Linus Torvalds
     

30 May, 2013

1 commit

  • The patch that converted raid5 to use bio_reset() forgot to initialize
    bi_vcnt.

    Signed-off-by: Kent Overstreet
    Cc: NeilBrown
    Cc: linux-raid@vger.kernel.org
    Tested-by: Ilia Mirkin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

20 May, 2013

1 commit

  • Fix detection of the need to resize the dm thin metadata device.

    The code incorrectly tried to extend the metadata device when it
    didn't need to due to a merging error with patch 24347e9 ("dm thin:
    detect metadata device resizing").

    device-mapper: transaction manager: couldn't open metadata space map
    device-mapper: thin metadata: tm_open_with_sm failed
    device-mapper: thin: aborting transaction failed
    device-mapper: thin: switching pool to failure mode

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

15 May, 2013

4 commits


10 May, 2013

13 commits