19 Aug, 2014

4 commits

  • Most places which allocate an r10_bio zero the ->state, some don't.
    As the r10_bio comes from a mempool, and the allocation function uses
    kzalloc it is often zero anyway. But sometimes it isn't and it is
    best to be safe.

    I only noticed this because of the bug fixed by an earlier patch
    where the r10_bios allocated for a reshape were left around to
    be used by a subsequent resync. In that case the R10BIO_IsReshape
    flag caused problems.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If raid10 reshape fails to find somewhere to read a block
    from, it returns without freeing memory...

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a raid10 commences a resync/recovery/reshape it allocates
    some buffer space.
    When a resync/recovery completes the buffer space is freed. But not
    when the reshape completes.
    This can result in a small memory leak.

    There is a subtle side-effect of this bug. When a RAID10 is reshaped
    to a larger array (more devices), the reshape is immediately followed
    by a "resync" of the new space. This "resync" will use the buffer
    space which was allocated for "reshape". This can cause problems
    including a "BUG" in the SCSI layer. So this is suitable for -stable.

    Cc: stable@vger.kernel.org (v3.5+)
    Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6
    Signed-off-by: NeilBrown

    NeilBrown
     
  • raid10 reshape clears unwanted bits from a bio->bi_flags using
    a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
    was added.
    Since then it clears that bit but shouldn't. This results in a
    memory leak.

    So change to used the approved method of clearing unwanted bits.

    As this causes a memory leak which can consume all of memory
    the fix is suitable for -stable.

    Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd
    Cc: stable@vger.kernel.org (v3.10+)
    Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
    Signed-off-by: NeilBrown

    NeilBrown
     

18 Aug, 2014

2 commits

  • During recovery of a double-degraded RAID6 it is possible for
    some blocks not to be recovered properly, leading to corruption.

    If a write happens to one block in a stripe that would be written to a
    missing device, and at the same time that stripe is recovering data
    to the other missing device, then that recovered data may not be written.

    This patch skips, in the double-degraded case, an optimisation that is
    only safe for single-degraded arrays.

    Bug was introduced in 2.6.32 and fix is suitable for any kernel since
    then. In an older kernel with separate handle_stripe5() and
    handle_stripe6() functions the patch must change handle_stripe6().

    Cc: stable@vger.kernel.org (2.6.32+)
    Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
    Cc: Yuri Tikhonov
    Cc: Dan Williams
    Reported-by: "Manibalan P"
    Tested-by: "Manibalan P"
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
    Signed-off-by: NeilBrown
    Acked-by: Dan Williams

    NeilBrown
     
  • If a stripe in a raid6 array received a write to each data block while
    the array is degraded, and if any of these writes to a missing device
    are not page-aligned, then a live-lock happens.

    In this case the P and Q blocks need to be read so that the part of
    the missing block which is *not* being updated by the write can be
    constructed. Due to a logic error, these blocks are not loaded, so
    the update cannot proceed and the stripe is 'handled' repeatedly in an
    infinite loop.

    This bug is unlikely as most writes are page aligned. However as it
    can lead to a livelock it is suitable for -stable. It was introduced
    in 3.16.

    Cc: stable@vger.kernel.org (v3.16)
    Fixed: 67f455486d2ea20b2d94d6adf5b9b783d079e321
    Signed-off-by: NeilBrown

    NeilBrown
     

14 Aug, 2014

2 commits

  • Pull device mapper changes from Mike Snitzer:

    - Allow the thin target to paired with any size external origin; also
    allow thin snapshots to be larger than the external origin.

    - Add support for quickly loading a repetitive pattern into the
    dm-switch target.

    - Use per-bio data in the dm-crypt target instead of always using a
    mempool for each allocation. Required switching to kmalloc alignment
    for the bio slab.

    - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

    - Fix the dm-cache and dm-thin targets' export of the minimum_io_size
    to match the data block size -- this fixes an issue where mkfs.xfs
    would improperly infer raid striping was in place on the underlying
    storage.

    - Small cleanups in dm-io, dm-mpath and dm-cache

    * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm table: propagate QUEUE_FLAG_NO_SG_MERGE
    dm switch: efficiently support repetitive patterns
    dm switch: factor out switch_region_table_read
    dm cache: set minimum_io_size to cache's data block size
    dm thin: set minimum_io_size to pool's data block size
    dm crypt: use per-bio data
    block: use kmalloc alignment for bio slab
    dm table: make dm_table_supports_discards static
    dm cache metadata: use dm-space-map-metadata.h defined size limits
    dm cache: fail migrations in the do_worker error path
    dm cache: simplify deferred set reference count increments
    dm thin: relax external origin size constraints
    dm thin: switch to an atomic_t for tracking pending new block preparations
    dm mpath: eliminate pg_ready() wrapper
    dm io: simplify dec_count and sync_io

    Linus Torvalds
     
  • Pull block driver changes from Jens Axboe:
    "Nothing out of the ordinary here, this pull request contains:

    - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
    and Surbhi Palande. No new features, just a lot of fixes.

    - The usual round of drbd updates from Andreas Gruenbacher, Lars
    Ellenberg, and Philipp Reisner.

    - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
    has taken it one step further and added support for actually using
    more than one queue.

    - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
    compliment the the default behavior of adding to the tail of the
    queue. From Douglas Gilbert"

    * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
    bcache: Drop unneeded blk_sync_queue() calls
    bcache: add mutex lock for bch_is_open
    bcache: Correct printing of btree_gc_max_duration_ms
    bcache: try to set b->parent properly
    bcache: fix memory corruption in init error path
    bcache: fix crash with incomplete cache set
    bcache: Fix more early shutdown bugs
    bcache: fix use-after-free in btree_gc_coalesce()
    bcache: Fix an infinite loop in journal replay
    bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
    bcache: bcache_write tracepoint was crashing
    bcache: fix typo in bch_bkey_equal_header
    bcache: Allocate bounce buffers with GFP_NOWAIT
    bcache: Make sure to pass GFP_WAIT to mempool_alloc()
    bcache: fix uninterruptible sleep in writeback thread
    bcache: wait for buckets when allocating new btree root
    bcache: fix crash on shutdown in passthrough mode
    bcache: fix lockdep warnings on shutdown
    bcache allocator: send discards with correct size
    bcache: Fix to remove the rcu_sched stalls.
    ...

    Linus Torvalds
     

11 Aug, 2014

2 commits

  • Pull md updates from Neil Brown:
    "Most interesting is that md devices (major == 9) with minor numbers of
    512 or more will no longer be created simply by opening a block device
    file. They can only be created by writing to

    /sys/module/md_mod/parameters/new_array

    The 'auto-create-on-open' semantic is cumbersome and we need to start
    moving away from it"

    * tag 'md/3.17' of git://neil.brown.name/md:
    md: don't allow bitmap file to be added to raid0/linear.
    md/raid0: check for bitmap compatability when changing raid levels.
    md: Recovery speed is wrong
    md: disable probing for md devices 512 and over.
    md/raid1,raid10: always abort recover on write error.

    Linus Torvalds
     
  • Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
    introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by
    default in blk_mq_init_queue for mq-enabled devices. The effect of
    the flag is to bypass the SG segment merging. Instead, the
    bio->bi_vcnt is used as the number of hardware segments.

    With a device mapper target on top of a device with
    QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
    than a driver is prepared to handle. I ran into this when backporting
    the virtio_blk mq support. It triggerred this BUG_ON, in
    virtio_queue_rq:

    BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);

    The queue's max is set here:
    blk_queue_max_segments(q, vblk->sg_elems-2);

    Basically, what happens is that a bio is built up for the dm device
    (which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
    bio_add_page. That path will call into __blk_recalc_rq_segments, so
    what you end up with is bi_phys_segments being much smaller than bi_vcnt
    (and bi_vcnt grows beyond the maximum sg elements). Then, when the bio
    is submitted, it gets cloned. When the cloned bio is submitted, it will
    end up in blk_recount_segments, here:

    if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
    bio->bi_phys_segments = bio->bi_vcnt;

    and now we've set bio->bi_phys_segments to a number that is beyond what
    was registered as queue_max_segments by the driver.

    The right way to fix this is to propagate the queue flag up the stack.

    The rules for propagating the flag are simple:
    - if the flag is set for any underlying device, it must be set for the
    upper device
    - consequently, if the flag is not set for any underlying device, it
    should not be set for the upper device.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 3.16+

    Jeff Moyer
     

08 Aug, 2014

3 commits


05 Aug, 2014

23 commits


02 Aug, 2014

4 commits

  • Add support for quickly loading a repetitive pattern into the
    dm-switch target.

    In the "set_regions_mappings" message, the user may now use "Rn,m" as
    one of the arguments. "n" and "m" are hexadecimal numbers. The "Rn,m"
    argument repeats the last "n" arguments in the following "m" slots.

    For example:
    dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
    is equivalent to
    dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
    :1 :2 :1 :2 :1 :2 :1 :2 :1 :2

    Requested-by: Jay Wang
    Tested-by: Jay Wang
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • Move code that reads the table to a switch_region_table_read.
    It will be needed for the next commit. No functional change.

    Tested-by: Jay Wang
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • Before, if the block layer's limit stacking didn't establish an
    optimal_io_size that was compatible with the cache's data block size
    we'd set optimal_io_size to the data block size and minimum_io_size to 0
    (which the block layer adjusts to be physical_block_size).

    Update cache_io_hints() to set both minimum_io_size and optimal_io_size
    to the cache's data block size. This fixes an issue where mkfs.xfs
    would create more XFS Allocation Groups on cache volumes than on a
    normal linear LV of comparable size.

    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • Before, if the block layer's limit stacking didn't establish an
    optimal_io_size that was compatible with the thin-pool's data block size
    we'd set optimal_io_size to the data block size and minimum_io_size to 0
    (which the block layer adjusts to be physical_block_size).

    Update pool_io_hints() to set both minimum_io_size and optimal_io_size
    to the thin-pool's data block size. This fixes an issue reported where
    mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
    on a normal linear LV of comparable size, see:
    https://bugzilla.redhat.com/show_bug.cgi?id=1003227

    Reported-by: Chris Murphy
    Signed-off-by: Mike Snitzer

    Mike Snitzer