14 Oct, 2012

1 commit

  • Pull md updates from NeilBrown:
    - "discard" support, some dm-raid improvements and other assorted bits
    and pieces.

    * tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
    md: refine reporting of resync/reshape delays.
    md/raid5: be careful not to resize_stripes too big.
    md: make sure manual changes to recovery checkpoint are saved.
    md/raid10: use correct limit variable
    md: writing to sync_action should clear the read-auto state.
    Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
    md/raid5: make sure to_read and to_write never go negative.
    md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
    md/raid5: protect debug message against NULL derefernce.
    md/raid5: add some missing locking in handle_failed_stripe.
    MD: raid5 avoid unnecessary zero page for trim
    MD: raid5 trim support
    md/bitmap:Don't use IS_ERR to judge alloc_page().
    md/raid1: Don't release reference to device while handling read error.
    raid: replace list_for_each_continue_rcu with new interface
    add further __init annotations to crypto/xor.c
    DM RAID: Fix for "sync" directive ineffectiveness
    DM RAID: Fix comparison of index and quantity for "rebuild" parameter
    DM RAID: Add rebuild capability for RAID10
    DM RAID: Move 'rebuild' checking code to its own function
    ...

    Linus Torvalds
     

13 Oct, 2012

4 commits

  • Use the recently-added bio front_pad field to allocate struct dm_target_io.

    Prior to this patch, dm_target_io was allocated from a mempool. For each
    dm_target_io, there is exactly one bio allocated from a bioset.

    This patch merges these two allocations into one allocation: we create a
    bioset with front_pad equal to the size of dm_target_io so that every
    bio allocated from the bioset has sizeof(struct dm_target_io) bytes
    before it. We allocate a bio and use the bytes before the bio as
    dm_target_io.

    _tio_cache is removed and the tio_pool mempool is now only used for
    request-based devices.

    This idea was introduced by Kent Overstreet.

    Signed-off-by: Mikulas Patocka
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Cc: tj@kernel.org
    Cc: Vivek Goyal
    Cc: Bill Pemberton
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The bio prison code will be useful to other future DM targets so
    move it to a separate module.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The bio prison code will be useful to share with future DM targets.

    Prepare to move this code into a separate module, adding a dm prefix
    to structures and functions that will be exported.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Support discards when the pool's block size is not a power of 2.
    The block layer assumes discard_granularity is a power of 2 (in
    blkdev_issue_discard), so we set this to the largest power of 2 that is
    a divides into the number of sectors in each block, but never less than
    DATA_DEV_BLOCK_SIZE_MIN_SECTORS.

    This patch eliminates the "Discard support must be disabled when the
    block size is not a power of 2" constraint that was imposed in commit
    55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That
    commit was incomplete: using a block size that is not a power of 2
    shouldn't mean disabling discard support on the device completely.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

12 Oct, 2012

4 commits


11 Oct, 2012

29 commits

  • If 'resync_max' is set to 0 (as is often done when starting a
    reshape, so the mdadm can remain in control during a sensitive
    period), and if the reshape request is initially delayed because
    another array using the same array is resyncing or reshaping etc,
    when user-space cannot easily tell when the delay changes from being
    due to a conflicting reshape, to being due to resync_max = 0.

    So introduce a new state: (curr_resync == 3) to reflect this, make
    sure it is visible both via /proc/mdstat and via the "sync_completed"
    sysfs attribute, and ensure that the event transition from one delay
    state to the other is properly notified.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a RAID5 is reshaping, conf->raid_disks is increased
    before mddev->delta_disks becomes zero.
    This can result in check_reshape calling resize_stripes with a
    number that is too large. This particularly happens
    when md_check_recovery calls ->check_reshape().

    If we use ->previous_raid_disks, we don't risk this.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If you make an array bigger but suppress resync of the new region with
    mdadm --grow /dev/mdX --size=max --assume-clean

    then stop the array before anything is written to it, the effect of
    the "--assume-clean" is lost and the array will resync the new space
    when restarted.
    So ensure that we update the metadata in the case.

    Reported-by: Sebastian Riemer
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Clang complains that we are assigning a variable to itself. This should
    be using bad_sectors like the similar earlier check does.

    Bug has been present since 3.1-rc1. It is minor but could
    conceivably cause corruption or other bad behaviour.

    Cc: stable@vger.kernel.org
    Signed-off-by: Dan Carpenter
    Signed-off-by: NeilBrown

    Dan Carpenter
     
  • In some cases array are started in 'read-auto' state where in
    nothing gets written to any device until the array is written
    to. The purpose of this is to make accidental auto-assembly
    of the wrong arrays less of a risk, and to allow arrays to be
    started to read suspend-to-disk images without actually changing
    anything (as might happen if the array were dirty and a
    resync seemed necessary).

    Explicitly writing the 'sync_action' for a read-auto array currently
    doesn't clear the read-auto state, so the sync action doesn't
    happen, which can be confusing.

    So allow any successful write to sync_action to clear any read-auto
    state.

    Reported-by: Alexander Kühn
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Now that multiple threads can handle stripes, it is safer to
    use an atomic64_t for resync_mismatches, to avoid update races.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    Jianpeng Ma
     
  • to_read and to_write are part of the result of analysing
    a stripe before handling it.
    Their use is to avoid some loops and tests if the values are
    known to be zero. Thus it is not a problem if they are a
    little bit larger than they should be.

    So decrementing them in handle_failed_stripe serves little value, and
    due to races it could cause some loops to be skipped incorrectly.

    So remove those decrements.

    Reported-by: "Jianpeng Ma"
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Signed-off-by: Alex Lyakas
    Suggested-by: Yair Hershko
    Signed-off-by: NeilBrown

    Alexander Lyakas
     
  • The pr_debug in add_stripe_bio could race with something
    changing *bip, so it is best to hold the lock until
    after the pr_debug.

    Reported-by: "Jianpeng Ma"
    Signed-off-by: NeilBrown

    NeilBrown
     
  • We really should hold the stripe_lock while accessing
    'toread' else we could race with add_stripe_bio and corrupt
    a list.

    Reported-by: "Jianpeng Ma"
    Signed-off-by: NeilBrown

    NeilBrown
     
  • We want to avoid zero discarded dev page, because it's useless for discard.
    But if we don't zero it, another read/write hit such page in the cache and will
    get inconsistent data.

    To avoid zero the page, we don't set R5_UPTODATE flag after construction is
    done. In this way, discard write request is still issued and finished, but read
    will not hit the page. If the stripe gets accessed soon, we need reread the
    stripe, but since the chance is low, the reread isn't a big deal.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • Discard for raid4/5/6 has limitation. If discard request size is
    small, we do discard for one disk, but we need calculate parity and
    write parity disk. To correctly calculate parity, zero_after_discard
    must be guaranteed. Even it's true, we need do discard for one disk
    but write another disks, which makes the parity disks wear out
    fast. This doesn't make sense. So an efficient discard for raid4/5/6
    should discard all data disks and parity disks, which requires the
    write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
    is smaller than chunk_size, such pattern is almost impossible in
    practice. So in this patch, I only handle the case that A's size
    equals to chunk_size. That is discard request should be aligned to
    stripe size and its size is multiple of stripe size.

    Since we can only handle request with specific alignment and size (or
    part of the request fitting stripes), we can't guarantee
    zero_after_discard even zero_after_discard is true in low level
    drives.

    The block layer doesn't send down correctly aligned requests even
    correct discard alignment is set, so I must filter out.

    For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
    zero_after_discard is true for all disks, data is consistent after
    discard. Otherwise, data might be lost. Let's consider a scenario:
    discard a stripe, write data to one disk and write parity disk. The
    stripe could be still inconsistent till then depending on using data
    from other data disks or parity disks to calculate new parity. If the
    disk is broken, we can't restore it. So in this patch, we only enable
    discard support if all disks have zero_after_discard.

    If discard fails in one disk, we face the similar inconsistent issue
    above. The patch will make discard follow the same path as normal
    write request. If discard fails, a resync will be scheduled to make
    the data consistent. This isn't good to have extra writes, but data
    consistency is important.

    If a subsequent read/write request hits raid5 cache of a discarded
    stripe, the discarded dev page should have zero filled, so the data is
    consistent. This patch will always zero dev page for discarded request
    stripe. This isn't optimal because discard request doesn't need such
    payload. Next patch will avoid it.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    Jianpeng Ma
     
  • When we get a read error, we arrange for raid1d to handle it.
    Currently we release the reference on the device. This can result
    in
    conf->mirrors[read_disk].rdev
    being NULL in fix_read_error, if the device happens to get removed
    before the read error is handled.

    So instead keep the reference until the read error has been fully
    handled.

    Reported-by: hank
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch replaces list_for_each_continue_rcu() with
    list_for_each_entry_continue_rcu() to save a few lines
    of code and allow removing list_for_each_continue_rcu().

    Reviewed-by: Paul E. McKenney
    Signed-off-by: Michael Wang
    Signed-off-by: NeilBrown

    Michael Wang
     
  • There are two table arguments that can be given to a DM RAID target
    that control whether the array is forced to (re)synchronize or skip
    initialization: "sync" and "nosync". When "sync" is given, we set
    mddev->recovery_cp to 0 in order to cause the device to resynchronize.
    This is insufficient if there is a bitmap in use, because the array
    will simply look at the bitmap and see that there is no recovery
    necessary.

    The fix is to skip over the loading of the superblocks when "sync" is
    given, causing new superblocks to be written that will force the array
    to go through initialization (i.e. synchronization).

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • DM RAID: Fix comparison of index and quantity for "rebuild" parameter

    The "rebuild" parameter takes an index argument that starts counting from
    zero. The conditional used to validate the index was using '>' rather than
    '>=', leaving the door open for an index value that would be 1 too large.

    Reported-by: Neil Brown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • DM RAID: Add code to validate replacement slots for RAID10 arrays

    RAID10 can handle 'copies - 1' failures for each mirror group. This code
    ensures the user has provided a valid array - one whose devices specified for
    rebuild do not exceed the amount of redundancy available.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • DM RAID: Move chunk of code to it's own function

    The code that checks whether device replacements/rebuilds are possible given
    a specific RAID type is moved to it's own function. It will further expand
    when the code to check RAID10 is added. A separate function makes it easier
    to read.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid

    When device-mapper uses the RAID10 personality through dm-raid.c, there is no
    'gendisk' structure in mddev and some sysfs information is also not populated.

    This patch avoids touching those non-existent structures.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Some ioctls don't need to take the mutex and doing so can cause
    a delay as it is held during super-block update.
    So move those ioctls out of the mutex and rely on rcu locking
    to ensure we don't access stale data.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Change the thread parameter, so the thread can carry extra info. Next patch
    will use it.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • queuing writes to the md thread means that all requests go through the
    one processor which may not be able to keep up with very high request
    rates.

    So use the plugging infrastructure to submit all requests on unplug.
    If a 'schedule' is needed, we fall back on the old approach of handing
    the requests to the thread for it to handle.

    This is nearly identical to a recent patch which provided similar
    functionality to RAID1.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This makes md raid 10 support TRIM.

    If one disk supports discard and another not, or one has
    discard_zero_data and another not, there could be inconsistent between
    data from such disks. But this should not matter, discarded data is
    useless. This will add extra copy in rebuild though.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • This makes md raid 1 support TRIM.
    If one disk supports discard and another not, or one has discard_zero_data and
    another not, there could be inconsistent between data from such disks. But this
    should not matter, discarded data is useless. This will add extra copy in rebuild
    though.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • This makes md raid 0 support TRIM.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • This makes md linear support TRIM.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • According to the comment in linear_stop function
    rcu_dereference in linear_start and linear_stop functions
    occurs under reconfig_mutex. The patch represents this
    agreement in code and prevents lockdep complaint.

    Found by Linux Driver Verification project (linuxtesting.org)

    Signed-off-by: Denis Efremov
    Signed-off-by: NeilBrown

    Denis Efremov
     
  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

29 Sep, 2012

1 commit

  • Pull dm fixes from Alasdair G Kergon:
    "A few fixes for problems discovered during the 3.6 cycle.

    Of particular note, are fixes to the thin target's discard support,
    which I hope is finally working correctly; and fixes for multipath
    ioctls and device limits when there are no paths."

    * tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
    dm verity: fix overflow check
    dm thin: fix discard support for data devices
    dm thin: tidy discard support
    dm: retain table limits when swapping to new table with no devices
    dm table: clear add_random unless all devices have it set
    dm: handle requests beyond end of device instead of using BUG_ON
    dm mpath: only retry ioctl when no paths if queue_if_no_path set
    dm thin: do not set discard_zeroes_data

    Linus Torvalds