27 Jul, 2013

1 commit


25 Jul, 2013

2 commits

  • If a device in a RAID4/5/6 is being replaced while another is being
    recovered, then the writes to the replacement device currently don't
    happen, resulting in corruption when the replacement completes and the
    new drive takes over.

    This is because the replacement writes are only triggered when
    's.replacing' is set and not when the similar 's.sync' is set (which
    is the case during resync and recovery - it means all devices need to
    be read).

    So schedule those writes when s.replacing is set as well.

    In this case we cannot use "STRIPE_INSYNC" to record that the
    replacement has happened as that is needed for recording that any
    parity calculation is complete. So introduce STRIPE_REPLACED to
    record if the replacement has happened.

    For safety we should also check that STRIPE_COMPUTE_RUN is not set.
    This has a similar effect to the "s.locked == 0" test. The latter
    ensure that now IO has been flagged but not started. The former
    checks if any parity calculation has been flagged by not started.
    We must wait for both of these to complete before triggering the
    'replace'.

    Add a similar test to the subsequent check for "are we finished yet".
    This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
    but it makes it more obvious that the REPLACE will happen before we
    think we are finished.

    Finally if a NeedReplace device is not UPTODATE then that is an
    error. We really must trigger a warning.

    This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2
    (md/raid5: detect and handle replacements during recovery.)
    which introduced replacement for raid5.
    That was in 3.3-rc3, so any stable kernel since then would benefit
    from this fix.

    Cc: stable@vger.kernel.org (3.3+)
    Reported-by: qindehua
    Tested-by: qindehua
    Signed-off-by: NeilBrown

    NeilBrown
     
  • We always need to be careful when calling generic_make_request, as it
    can start a chain of events which might free something that we are
    using.

    Here is one place I wasn't careful enough. If the wbio2 is not in
    use, then it might get freed at the first generic_make_request call.
    So perform all necessary tests first.

    This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
    oops, so fix is suitable for any -stable since then.

    Cc: stable@vger.kernel.org (3.3+)
    Signed-off-by: NeilBrown

    NeilBrown
     

23 Jul, 2013

1 commit

  • Pull block IO driver bits from Jens Axboe:
    "As I mentioned in the core block pull request, due to real life
    circumstances the driver pull request would be late. Now it looks
    like -rc2 late... On the plus side, apart form the rsxx update, these
    are all things that I could argue could go in later in the cycle as
    they are fixes and not features. So even though things are late, it's
    not ALL bad.

    The pull request contains:

    - Updates to bcache, all bug fixes, from Kent.

    - A pile of drbd bug fixes (no big features this time!).

    - xen blk front/back fixes.

    - rsxx driver updates, some of them deferred form 3.10. So should be
    well cooked by now"

    * 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
    bcache: Allocation kthread fixes
    bcache: Fix GC_SECTORS_USED() calculation
    bcache: Journal replay fix
    bcache: Shutdown fix
    bcache: Fix a sysfs splat on shutdown
    bcache: Advertise that flushes are supported
    bcache: check for allocation failures
    bcache: Fix a dumb race
    bcache: Use standard utility code
    bcache: Update email address
    bcache: Delete fuzz tester
    bcache: Document shrinker reserve better
    bcache: FUA fixes
    drbd: Allow online change of al-stripes and al-stripe-size
    drbd: Constants should be UPPERCASE
    drbd: Ignore the exit code of a fence-peer handler if it returns too late
    drbd: Fix rcu_read_lock balance on error path
    drbd: fix error return code in drbd_init()
    drbd: Do not sleep inside rcu
    bcache: Refresh usage docs
    ...

    Linus Torvalds
     

18 Jul, 2013

3 commits

  • Recent change to use bio_copy_data() in raid1 when repairing
    an array is faulty.

    The underlying may have changed the bio in various ways using
    bio_advance and these need to be undone not just for the 'sbio' which
    is being copied to, but also the 'pbio' (primary) which is being
    copied from.

    So perform the reset on all bios that were read from and do it early.

    This also ensure that the sbio->bi_io_vec[j].bv_len passed to
    memcmp is correct.

    This fixes a crash during a 'check' of a RAID1 array. The crash was
    introduced in 3.10 so this is suitable for 3.10-stable.

    Cc: stable@vger.kernel.org (3.10)
    Reported-by: Joe Lawrence
    Signed-off-by: NeilBrown

    NeilBrown
     
  • commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
    md: Allow devices to be re-added to a read-only array.

    allowed a bit more than just that. It also allows devices to be added
    to a read-write array and to end up skipping recovery.

    This patch removes the offending piece of code pending a rewrite for a
    subsequent release.

    More specifically:
    If the array has a bitmap, then the device will still need a bitmap
    based resync ('saved_raid_disk' is set under different conditions
    is a bitmap is present).
    If the array doesn't have a bitmap, then this is correct as long as
    nothing has been written to the array since the metadata was checked
    by ->validate_super. However there is no locking to ensure that there
    was no write.

    Bug was introduced in 3.10 and causes data corruption so
    patch is suitable for 3.10-stable.

    Cc: stable@vger.kernel.org (3.10)
    Reported-by: Joe Lawrence
    Signed-off-by: NeilBrown

    NeilBrown
     
  • 1/ When an different between blocks is found, data is copied from
    one bio to the other. However bv_len is used as the length to
    copy and this could be zero. So use r10_bio->sectors to calculate
    length instead.
    Using bv_len was probably always a bit dubious, but the introduction
    of bio_advance made it much more likely to be a problem.

    2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
    except on bios that we schedule for a read. This ensures that
    missing/failed devices don't confuse the loop at the top of
    sync_request write.
    Commit 8be185f2c9d54d6 "raid10: Use bio_reset()"
    removed a loop which set BIO_UPTDATE on all appropriate bios.
    So we need to re-add that flag.

    These bugs were introduced in 3.10, so this patch is suitable for
    3.10-stable, and can remove a potential for data corruption.

    Cc: stable@vger.kernel.org (3.10)
    Reported-by: Brassow Jonathan
    Signed-off-by: NeilBrown

    NeilBrown
     

12 Jul, 2013

9 commits

  • The alloc kthread should've been using try_to_freeze() - and also there
    was the potential for the alloc kthread to get woken up after it had
    shut down, which would have been bad.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     
  • Part of the job of garbage collection is to add up however many sectors
    of live data it finds in each bucket, but that doesn't work very well if
    it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

    This wouldn't have broken anything horribly, but allocation tries to
    preferentially reclaim buckets that are mostly empty and that's not
    gonna work with an incorrect GC_SECTORS_USED() value.

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • The journal replay code starts by finding something that looks like a
    valid journal entry, then it does a binary search over the unchecked
    region of the journal for the journal entries with the highest sequence
    numbers.

    Trouble is, the logic was wrong - journal_read_bucket() returns true if
    it found journal entries we need, but if the range of journal entries
    we're looking for loops around the end of the journal - in that case
    journal_read_bucket() could return true when it hadn't found the highest
    sequence number we'd seen yet, and in that case the binary search did
    the wrong thing. Whoops.

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • Stopping a cache set is supposed to make it stop attached backing
    devices, but somewhere along the way that code got lost. Fixing this
    mainly has the effect of fixing our reboot notifier.

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • If we stopped a bcache device when we were already detaching (or
    something like that), bcache_device_unlink() would try to remove a
    symlink from sysfs that was already gone because the bcache dev kobject
    had already been removed from sysfs.

    So keep track of whether we've removed stuff from sysfs.

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
    out unless we say we support them...

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • There is a missing NULL check after the kzalloc().

    Signed-off-by: Dan Carpenter

    Dan Carpenter
     
  • In the far-too-complicated closure code - closures can have destructors,
    for probably dubious reasons; they get run after the closure is no
    longer waiting on anything but before dropping the parent ref, intended
    just for freeing whatever memory the closure is embedded in.

    Trouble is, when remaining goes to 0 and we've got nothing more to run -
    we also have to unlock the closure, setting remaining to -1. If there's
    a destructor, that unlock isn't doing anything - nobody could be trying
    to lock it if we're about to free it - but if the unlock _is needed...
    that check for a destructor was racy. Argh.

    Signed-off-by: Kent Overstreet
    Cc: linux-stable # >= v3.10

    Kent Overstreet
     
  • Pull device-mapper changes from Alasdair G Kergon:
    "Add a device-mapper target called dm-switch to provide a multipath
    framework for storage arrays that dynamically reconfigure their
    preferred paths for different device regions.

    Fix a bug in the verity target that prevented its use with some
    specific sizes of devices.

    Improve some locking mechanisms in the device-mapper core and bufio.

    Add Mike Snitzer as a device-mapper maintainer.

    A few more clean-ups and fixes"

    * tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
    dm: add switch target
    dm: update maintainers
    dm: optimize reorder structure
    dm: optimize use SRCU and RCU
    dm bufio: submit writes outside lock
    dm cache: fix arm link errors with inline
    dm verity: use __ffs and __fls
    dm flakey: correct ctr alloc failure mesg
    dm verity: remove pointless comparison
    dm: use __GFP_HIGHMEM in __vmalloc
    dm verity: fix inability to use a few specific devices sizes
    dm ioctl: set noio flag to avoid __vmalloc deadlock
    dm mpath: fix ioctl deadlock when no paths

    Linus Torvalds
     

11 Jul, 2013

12 commits

  • dm-switch is a new target that maps IO to underlying block devices
    efficiently when there is a large number of fixed-sized address regions
    but there is no simple pattern to allow for a compact mapping
    representation such as dm-stripe.

    Though we have developed this target for a specific storage device, Dell
    EqualLogic, we have made an effort to keep it as general purpose as
    possible in the hope that others may benefit.

    Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.

    Signed-off-by: Jim Ramsay
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Jim Ramsay
     
  • This reorder actually improves performance by 20% (from 39.1s to 32.8s)
    on x86-64 quad core Opteron.

    I have no explanation for this, possibly it makes some other entries are
    better cache-aligned.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch removes "io_lock" and "map_lock" in struct mapped_device and
    "holders" in struct dm_table and replaces these mechanisms with
    sleepable-rcu.

    Previously, the code would call "dm_get_live_table" and "dm_table_put" to
    get and release table. Now, the code is changed to call "dm_get_live_table"
    and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
    dm_put_live_table unlocks it.

    dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
    dm_get_live_table/dm_put_live_table. These *_fast functions use
    non-sleepable RCU, so the caller must not block between them.

    If the code changes active or inactive dm table, it must call
    dm_sync_table before destroying the old table.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch changes dm-bufio so that it submits write I/Os outside of the
    lock. If the number of submitted buffers is greater than the number of
    requests on the target queue, submit_bio blocks. We want to block outside
    of the lock to improve latency of other threads that may need the lock.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
    gcc 4.7 is OK.

    It creates a function block_div.part.8, it references __udivdi3 and
    __umoddi3 and it is never called. The references to __udivdi3 and
    __umoddi3 cause a link failure.

    Reported-by: Rob Herring
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch changes ffs() to __ffs() and fls() to __fls() which don't add
    one to the result.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Remove the reference to the "linear" target from the error message
    issued when allocation fails in the flakey target.

    Cc: Robin Dong
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Remove num < 0 test in verity_ctr because num is unsigned.
    (Found by Coverity.)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use __GFP_HIGHMEM in __vmalloc.

    Pages allocated with __vmalloc can be allocated in high memory that is not
    directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
    does. This patch reduces memory pressure slightly because pages can be
    allocated in the high zone.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Fix a boundary condition that caused failure for certain device sizes.

    The problem is reported at
    http://code.google.com/p/cryptsetup/issues/detail?id=160

    For certain device sizes the number of hashes at a specific level was
    calculated incorrectly.

    It happens for example for a device with data and metadata block size 4096
    that has 16385 blocks and algorithm sha256.

    The user can test if he is affected by this bug by running the
    "veritysetup verify" command and also by activating the dm-verity kernel
    driver and reading the whole block device. If it passes without an error,
    then the user is not affected.

    The condition for the bug is:

    Split the total number of data blocks (data_block_bits) into bit strings,
    each string has hash_per_block_bits bits. hash_per_block_bits is
    rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
    can say that you convert data_blocks_bits to 2^hash_per_block_bits base.

    If there some zero bit string below the most significant bit string and at
    least one bit below this zero bit string is set, then the bug happens.

    The same bug exists in the userspace veritysetup tool, so you must use
    fixed veritysetup too if you want to use devices that are affected by
    this boundary condition.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # 3.4+
    Cc: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Set noio flag while calling __vmalloc() because it doesn't fully respect
    gfp flags to avoid a possible deadlock (see commit
    502624bdad3dba45dfaacaf36b7d83e39e74b2d2).

    This should be backported to stable kernels 3.8 and newer. The kernel 3.8
    doesn't have memalloc_noio_save(), so we should set and restore process
    flag PF_MEMALLOC instead.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • When multipath needs to retry an ioctl the reference to the
    current live table needs to be dropped. Otherwise a deadlock
    occurs when all paths are down:
    - dm_blk_ioctl takes a reference to the current table
    and spins in multipath_ioctl().
    - A new table is being loaded, but upon resume the process
    hangs in dm_table_destroy() waiting for references to
    drop to zero.

    With this patch the reference to the old table is dropped
    prior to retry, thereby avoiding the deadlock.

    Signed-off-by: Hannes Reinecke
    Cc: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Hannes Reinecke
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

2 commits

  • The recent comment:
    commit 7e83ccbecd608b971f340e951c9e84cd0343002f
    md/raid10: Allow skipping recovery when clean arrays are assembled

    Causes raid10 to skip a recovery in certain cases where it is safe to
    do so. Unfortunately it also causes a reshape to be skipped which is
    never safe. The result is that an attempt to reshape a RAID10 will
    appear to complete instantly, but no data will have been moves so the
    array will now contain garbage.
    (If nothing is written, you can recovery by simple performing the
    reverse reshape which will also complete instantly).

    Bug was introduced in 3.10, so this is suitable for 3.10-stable.

    Cc: stable@vger.kernel.org (3.10)
    Cc: Martin Wilck
    Signed-off-by: NeilBrown

    NeilBrown
     
  • There is a bug in 'check_reshape' for raid5.c To checks
    that the new minimum number of devices is large enough (which is
    good), but it does so also after the reshape has started (bad).

    This is bad because
    - the calculation is now wrong as mddev->raid_disks has changed
    already, and
    - it is pointless because it is now too late to stop.

    So only perform that test when reshape has not been committed to.

    Signed-off-by: NeilBrown

    NeilBrown
     

03 Jul, 2013

1 commit

  • 1/ If a RAID10 is being reshaped to a fewer number of devices
    and is stopped while this is ongoing, then when the array is
    reassembled the 'mirrors' array will be allocated too small.
    This will lead to an access error or memory corruption.

    2/ A sanity test for a reshaping RAID10 array is restarted
    is slightly incorrect.

    Due to the first bug, this is suitable for any -stable
    kernel since 3.5 where this code was introduced.

    Cc: stable@vger.kernel.org (v3.5+)
    Signed-off-by: NeilBrown

    NeilBrown
     

02 Jul, 2013

4 commits

  • Some of bcache's utility code has made it into the rest of the kernel,
    so drop the bcache versions.

    Bcache used to have a workaround for allocating from a bio set under
    generic_make_request() (if you allocated more than once, the bios you
    already allocated would get stuck on current->bio_list when you
    submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
    when allocating bios under generic_make_request() so that allocation
    could fail and it could retry from workqueue. But bio_alloc_bioset() has
    a workaround now, so we can drop this hack and the associated error
    handling.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     
  • This code has rotted and it hasn't been used in ages anyways.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     
  • Signed-off-by: Kent Overstreet

    Kent Overstreet
     
  • Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
    writes have... weird ordering requirements.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

27 Jun, 2013

4 commits