03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

29 Sep, 2012

1 commit

  • Pull dm fixes from Alasdair G Kergon:
    "A few fixes for problems discovered during the 3.6 cycle.

    Of particular note, are fixes to the thin target's discard support,
    which I hope is finally working correctly; and fixes for multipath
    ioctls and device limits when there are no paths."

    * tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
    dm verity: fix overflow check
    dm thin: fix discard support for data devices
    dm thin: tidy discard support
    dm: retain table limits when swapping to new table with no devices
    dm table: clear add_random unless all devices have it set
    dm: handle requests beyond end of device instead of using BUG_ON
    dm mpath: only retry ioctl when no paths if queue_if_no_path set
    dm thin: do not set discard_zeroes_data

    Linus Torvalds
     

27 Sep, 2012

9 commits

  • The 'enough' function is written to work with 'near' arrays only
    in that is implicitly assumes that the offset from one 'group' of
    devices to the next is the same as the number of copies.
    In reality it is the number of 'near' copies.

    So change it to make this number explicit.

    This bug makes it possible to run arrays without enough drives
    present, which is dangerous.
    It is appropriate for an -stable kernel, but will almost certainly
    need to be modified for some of them.

    Cc: stable@vger.kernel.org
    Reported-by: Jakub Husák
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This patch fixes sector_t overflow checking in dm-verity.

    Without this patch, the code checks for overflow only if sector_t is
    smaller than long long, not if sector_t and long long have the same size.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The discard limits that get established for a thin-pool or thin device
    may be incompatible with the pool's data device. Avoid this by checking
    the discard limits of the pool's data device. If an incompatibility is
    found then the pool's 'discard passdown' feature is disabled.

    Change thin_io_hints to ensure that a thin device always uses the same
    queue limits as its pool device.

    Introduce requested_pf to track whether or not the table line originally
    contained the no_discard_passdown flag and use this directly for table
    output. We prepare the correct setting for discard_passdown directly in
    bind_control_target (called from pool_io_hints) and store it in
    adjusted_pf rather than waiting until we have access to pool->pf in
    pool_preresume.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • A little thin discard code refactoring to make the next patch (dm thin:
    fix discard support for data devices) more readable.
    Pull out a couple of functions (and uses bools instead of unsigned for
    features).

    No functional changes.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add a safety net that will re-use the DM device's existing limits in the
    event that DM device has a temporary table that doesn't have any
    component devices. This is to reduce the chance that requests not
    respecting the hardware limits will reach the device.

    DM recalculates queue limits based only on devices which currently exist
    in the table. This creates a problem in the event all devices are
    temporarily removed such as all paths being lost in multipath. DM will
    reset the limits to the maximum permissible, which can then assemble
    requests which exceed the limits of the paths when the paths are
    restored. The request will fail the blk_rq_check_limits() test when
    sent to a path with lower limits, and will be retried without end by
    multipath. This became a much bigger issue after v3.6 commit fe86cdcef
    ("block: do not artificially constrain max_sectors for stacking
    drivers").

    Reported-by: David Jeffery
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
    have it set. Otherwise devices with predictable characteristics may
    contribute entropy.

    QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings
    contribute to the random pool.

    For bio-based targets this flag is always 0 because such devices have no
    real queue.

    For request-based devices this flag was always set to 1 by default.

    Now set it according to the flags on underlying devices. If there is at
    least one device which should not contribute, set the flag to zero: If a
    device, such as fast SSD storage, is not suitable for supplying entropy,
    a request-based queue stacked over it will not be either.

    Because the checking logic is exactly same as for the rotational flag,
    share the iteration function with device_is_nonrot().

    Signed-off-by: Milan Broz
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • The access beyond the end of device BUG_ON that was introduced to
    dm_request_fn via commit 29e4013de7ad950280e4b2208 ("dm: implement
    REQ_FLUSH/FUA support for request-based dm") was an overly
    drastic (but simple) response to this situation.

    I have received a report that this BUG_ON was hit and now think
    it would be better to use dm_kill_unmapped_request() to fail the clone
    and original request with -EIO.

    map_request() will assign the valid target returned by
    dm_table_find_target to tio->ti. But when the target
    isn't valid tio->ti is never assigned (because map_request isn't
    called); so add a check for tio->ti != NULL to dm_done().

    Reported-by: Mike Christie
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jun'ichi Nomura
    Cc: stable@vger.kernel.org # v2.6.37+
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • When there are no paths and multipath receives an ioctl, it waits until
    a path becomes available. This behaviour is incorrect if the
    "queue_if_no_path" setting was not specified, as then the ioctl should
    be rejected immediately, which this patch now does.

    commit 35991652b ("dm mpath: allow ioctls to trigger pg init") should
    have checked if queue_if_no_path was configured before queueing IO.

    Checking for the queue_if_no_path feature, like is done in map_io(),
    allows the following table load to work without blocking in the
    multipath_ioctl retry loop:

    echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs

    Without this fix the multipath_ioctl will block with the following stack
    trace:

    blkid D 0000000000000002 0 23936 1 0x00000000
    ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440
    ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440
    ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x182/0x2e0
    [] ? lock_timer_base+0x70/0x70
    [] schedule_timeout_uninterruptible+0x1e/0x20
    [] msleep+0x20/0x30
    [] multipath_ioctl+0x109/0x170 [dm_multipath]
    [] dm_blk_ioctl+0xbc/0xd0 [dm_mod]
    [] __blkdev_driver_ioctl+0x28/0x30
    [] blkdev_ioctl+0xce/0x730
    [] block_ioctl+0x3c/0x40
    [] do_vfs_ioctl+0x8c/0x340
    [] ? sys_newfstat+0x33/0x40
    [] sys_ioctl+0xa1/0xb0
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 3.5+
    Acked-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The dm thin pool target claims to support the zeroing of discarded
    data areas. This turns out to be incorrect when processing discards
    that do not exactly cover a complete number of blocks, so the target
    must always set discard_zeroes_data_unsupported.

    The thin pool target will zero blocks when they are allocated if the
    skip_block_zeroing feature is not specified. The block layer
    may send a discard that only partly covers a block. If a thin pool
    block is partially discarded then there is no guarantee that the
    discarded data will get zeroed before it is accessed again.
    Due to this, thin devices cannot claim discards will always zero data.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Thornber
    Cc: stable@vger.kernel.org # 3.4+
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

24 Sep, 2012

1 commit

  • commit b17459c05000fdbe8d10946570a26510f86ec0f
    raid5: add a per-stripe lock

    added a spin_lock to the 'stripe_head' struct.
    Unfortunately there are two places where this struct is allocated
    but the spin lock was only initialised in one of them.

    So add the missing spin_lock_init.

    Signed-off-by: NeilBrown

    NeilBrown
     

19 Sep, 2012

3 commits

  • It isn't always necessary to update the metadata when spares are
    removed as the presence-or-not of a spare isn't really important to
    the integrity of an array.
    Also activating a spare doesn't always require updating the metadata
    as the update on 'recovery-completed' is usually sufficient.

    However the introduction of 'replacement' devices have made these
    transitions sometimes more important. For example the 'Replacement'
    flag isn't cleared until the original device is removed, so we need
    to ensure a metadata update after that 'spare' is removed.

    So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
    complement the current situation where it is set when a spare is added
    or a device is failed (or a number of other less common situations).

    This is suitable for -stable as out-of-data metadata could lead
    to data corruption.
    This is only relevant for 3.3 and later 9when 'replacement' as
    introduced.

    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a replacement device becomes active, we mark the device that it
    replaces as 'faulty' so that it can subsequently get removed.
    However 'calc_degraded' only pays attention to the primary device, not
    the replacement, so the array appears to become degraded, which is
    wrong.

    So teach 'calc_degraded' to consider any replacement if a primary
    device is faulty.

    This is suitable for -stable as an incorrect 'degraded' value can
    confuse md and could lead to data corruption.
    This is only relevant for 3.3 and later.

    Cc: stable@vger.kernel.org
    Reported-by: Robin Hill
    Reported-by: John Drescher
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This reverts commit 895e3c5c58a80bb9e4e05d9ac38b4f30e0f97d80.

    While this patch seemed like a good idea and did help some workloads,
    it hurts other workloads.
    Large sequential O_DIRECT writes were faster,
    Small random O_DIRECT writes were slower.

    Other changes (batching RAID5 writes) have improved the sequential
    writes using a different mechanism, so the net result of this patch
    is definitely negative. So revert it.

    Reported-by: Shaohua Li
    Tested-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    NeilBrown
     

21 Aug, 2012

1 commit

  • flush[_delayed]_work_sync() are now spurious. Mark them deprecated
    and convert all users to flush[_delayed]_work().

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant and the regular flushes guarantee that the work item is
    not pending or running on any CPU on return, so there's no reason to
    use the sync flushes at all and they're going away.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Ian Campbell
    Cc: Jens Axboe
    Cc: Mattia Dongili
    Cc: Kent Yoder
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: Karsten Keil
    Cc: Bryan Wu
    Cc: Benjamin Herrenschmidt
    Cc: Alasdair Kergon
    Cc: Mauro Carvalho Chehab
    Cc: Florian Tobias Schandinat
    Cc: David Woodhouse
    Cc: "David S. Miller"
    Cc: linux-wireless@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: Sangbeom Kim
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: Eric Van Hensbergen
    Cc: Takashi Iwai
    Cc: Steven Whitehouse
    Cc: Petr Vandrovec
    Cc: Mark Fasheh
    Cc: Christoph Hellwig
    Cc: Avi Kivity

    Tejun Heo
     

18 Aug, 2012

1 commit

  • A 'struct r10bio' has an array of per-copy information at the end.
    This array is declared with size [0] and r10bio_pool_alloc allocates
    enough extra space to store the per-copy information depending on the
    number of copies needed.

    So declaring a 'struct r10bio on the stack isn't going to work. It
    won't allocate enough space, and memory corruption will ensue.

    So in the two places where this is done, declare a sufficiently large
    structure and use that instead.

    The two call-sites of this bug were introduced in 3.4 and 3.5
    so this is suitable for both those kernels. The patch will have to
    be modified for 3.4 as it only has one bug.

    Cc: stable@vger.kernel.org
    Reported-by: Ivan Vasilyev
    Tested-by: Ivan Vasilyev
    Signed-off-by: NeilBrown

    NeilBrown
     

16 Aug, 2012

1 commit

  • commit 27a7b260f71439c40546b43588448faac01adb93
    md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.

    changed 0.90 metadata handling to truncated size to 4TB as that is
    all that 0.90 can record.
    However for RAID0 and Linear, 0.90 doesn't need to record the size, so
    this truncation is not needed and causes working arrays to become too small.

    So avoid the truncation for RAID0 and Linear

    This bug was introduced in 3.1 and is suitable for any stable kernels
    from then onwards.
    As the offending commit was tagged for 'stable', any stable kernel
    that it was applied to should also get this patch. That includes
    at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
    providing that list).

    Cc: stable@vger.kernel.org
    Signed-off-by: Neil Brown

    NeilBrown
     

03 Aug, 2012

1 commit

  • Pull additional md update from NeilBrown:
    "This contains a few patches that depend on plugging changes in the
    block layer so needed to wait for those.

    It also contains a Kconfig fix for the new RAID10 support in dm-raid."

    * tag 'md-3.6' of git://neil.brown.name/md:
    md/dm-raid: DM_RAID should select MD_RAID10
    md/raid1: submit IO from originating thread instead of md thread.
    raid5: raid5d handle stripe in batch way
    raid5: make_request use batch stripe release

    Linus Torvalds
     

02 Aug, 2012

6 commits

  • Now that DM_RAID supports raid10, it needs to select that code
    to ensure it is included.

    Cc: Jonathan Brassow
    Reported-by: Fengguang Wu
    Signed-off-by: NeilBrown

    NeilBrown
     
  • queuing writes to the md thread means that all requests go through the
    one processor which may not be able to keep up with very high request
    rates.

    So use the plugging infrastructure to submit all requests on unplug.
    If a 'schedule' is needed, we fall back on the old approach of handing
    the requests to the thread for it to handle.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • make_request() does stripe release for every stripe and the stripe usually has
    count 1, which makes previous release_stripe() optimization not work. In my
    test, this release_stripe() becomes the heaviest pleace to take
    conf->device_lock after previous patches applied.

    Below patch makes stripe release batch. All the stripes will be released in
    unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
    lru.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • Pull block driver changes from Jens Axboe:

    - Making the plugging support for drivers a bit more sane from Neil.
    This supersedes the plugging change from Shaohua as well.

    - The usual round of drbd updates.

    - Using a tail add instead of a head add in the request completion for
    ndb, making us find the most completed request more quickly.

    - A few floppy changes, getting rid of a duplicated flag and also
    running the floppy init async (since it takes forever in boot terms)
    from Andi.

    * 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
    floppy: remove duplicated flag FD_RAW_NEED_DISK
    blk: pass from_schedule to non-request unplug functions.
    block: stack unplug
    blk: centralize non-request unplug handling.
    md: remove plug_cnt feature of plugging.
    block/nbd: micro-optimization in nbd request completion
    drbd: announce FLUSH/FUA capability to upper layers
    drbd: fix max_bio_size to be unsigned
    drbd: flush drbd work queue before invalidate/invalidate remote
    drbd: fix potential access after free
    drbd: call local-io-error handler early
    drbd: do not reset rs_pending_cnt too early
    drbd: reset congestion information before reporting it in /proc/drbd
    drbd: report congestion if we are waiting for some userland callback
    drbd: differentiate between normal and forced detach
    drbd: cleanup, remove two unused global flags
    floppy: Run floppy initialization asynchronous

    Linus Torvalds
     
  • Pull md updates from NeilBrown.

    * 'for-next' of git://neil.brown.name/md:
    DM RAID: Add support for MD RAID10
    md/RAID1: Add missing case for attempting to repair known bad blocks.
    md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
    md/raid1: don't abort a resync on the first badblock.
    md: remove duplicated test on ->openers when calling do_md_stop()
    raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
    md/raid1: prevent merging too large request
    md/raid1: read balance chooses idlest disk for SSD
    md/raid1: make sequential read detection per disk based
    MD RAID10: Export md_raid10_congested
    MD: Move macros from raid1*.h to raid1*.c
    MD RAID1: rename mirror_info structure
    MD RAID10: rename mirror_info structure
    MD RAID10: Fix compiler warning.
    raid5: add a per-stripe lock
    raid5: remove unnecessary bitmap write optimization
    raid5: lockless access raid5 overrided bi_phys_segments
    raid5: reduce chance release_stripe() taking device_lock

    Linus Torvalds
     

01 Aug, 2012

2 commits


31 Jul, 2012

13 commits

  • This will allow md/raid to know why the unplug was called,
    and will be able to act according - if !from_schedule it
    is safe to perform tasks which could themselves schedule.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Both md and umem has similar code for getting notified on an
    blk_finish_plug event.
    Centralize this code in block/ and allow each driver to
    provide its distinctive difference.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • This seemed like a good idea at the time, but after further thought I
    cannot see it making a difference other than very occasionally and
    testing to try to exercise the case it is most likely to help did not
    show any performance difference by removing it.

    So remove the counting of active plugs and allow 'pending writes' to
    be activated at any time, not just when no plugs are active.

    This is only relevant when there is a write-intent bitmap, and the
    updating of the bitmap will likely introduce enough delay that
    the single-threading of bitmap updates will be enough to collect large
    numbers of updates together.

    Removing this will make it easier to centralise the unplug code, and
    will clear the other for other unplug enhancements which have a
    measurable effect.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • When doing resync or repair, attempt to correct bad blocks, according
    to WriteErrorSeen policy

    Signed-off-by: Alex Lyakas
    Signed-off-by: NeilBrown

    Alexander Lyakas
     
  • Merge Andrew's first set of patches:
    "Non-MM patches:

    - lots of misc bits

    - tree-wide have_clk() cleanups

    - quite a lot of printk tweaks. I draw your attention to "printk:
    convert the format for KERN_ to a 2 byte pattern" which
    looks a bit scary. But afaict it's solid.

    - backlight updates

    - lib/ feature work (notably the addition and use of memweight())

    - checkpatch updates

    - rtc updates

    - nilfs updates

    - fatfs updates (partial, still waiting for acks)

    - kdump, proc, fork, IPC, sysctl, taskstats, pps, etc

    - new fault-injection feature work"

    * Merge emailed patches from Andrew Morton : (128 commits)
    drivers/misc/lkdtm.c: fix missing allocation failure check
    lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
    fault-injection: add tool to run command with failslab or fail_page_alloc
    fault-injection: add selftests for cpu and memory hotplug
    powerpc: pSeries reconfig notifier error injection module
    memory: memory notifier error injection module
    PM: PM notifier error injection module
    cpu: rewrite cpu-notifier-error-inject module
    fault-injection: notifier error injection
    c/r: fcntl: add F_GETOWNER_UIDS option
    resource: make sure requested range is included in the root range
    include/linux/aio.h: cpp->C conversions
    fs: cachefiles: add support for large files in filesystem caching
    pps: return PTR_ERR on error in device_create
    taskstats: check nla_reserve() return
    sysctl: suppress kmemleak messages
    ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
    ipc: compat: use signed size_t types for msgsnd and msgrcv
    ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
    ipc: add COMPAT_SHMLBA support
    ...

    Linus Torvalds
     
  • Use memweight() to count the total number of bits set in memory area.

    Signed-off-by: Akinobu Mita
    Cc: Alasdair Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • 'sync' writes set both REQ_SYNC and REQ_NOIDLE.
    O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

    We currently assume that a REQ_SYNC request will not be followed by
    more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
    request.
    This is appropriate for sync requests, but not for O_DIRECT requests.

    So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
    rather than REQ_SYNC. This is consistent with the documented meaning
    of REQ_NOIDLE:

    __REQ_NOIDLE, /* don't anticipate more IO after this one */

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • If a resync of a RAID1 array with 2 devices finds a known bad block
    one device it will neither read from, or write to, that device for
    this block offset.
    So there will be one read_target (The other device) and zero write
    targets.
    This condition causes md/raid1 to abort the resync assuming that it
    has finished - without known bad blocks this would be true.

    When there are no write targets because of the presence of bad blocks
    we should only skip over the area covered by the bad block.
    RAID10 already gets this right, raid1 doesn't. Or didn't.

    As this can cause a 'sync' to abort early and appear to have succeeded
    it could lead to some data corruption, so it suitable for -stable.

    Cc: stable@vger.kernel.org
    Reported-by: Alexander Lyakas
    Signed-off-by: NeilBrown

    NeilBrown
     
  • do_md_stop tests mddev->openers while holding ->open_mutex,
    and fails if this count is too high.
    So callers do not need to check mddev->openers and doing so isn't
    very meaningful as they don't hold ->open_mutex so the number could
    change.

    So remove the unnecessary tests on mddev->openers.
    These are not called often enough for there to be any gain in
    an early test on ->open_mutex to avoid the need for a slightly more
    costly mutex_lock call.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Because bios will merge at block-layer,so bios-error may caused by other
    bio which be merged into to the same request.
    Using this flag,it will find exactly error-sector and not do redundant
    operation like re-write and re-read.

    V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
    layer.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • For SSD, if request size exceeds specific value (optimal io size), request size
    isn't important for bandwidth. In such condition, if making request size bigger
    will cause some disks idle, the total throughput will actually drop. A good
    example is doing a readahead in a two-disk raid1 setup.

    So when should we split big requests? We absolutly don't want to split big
    request to very small requests. Even in SSD, big request transfer is more
    efficient. This patch only considers request with size above optimal io size.

    If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
    two requests 32k and two disks. We can let each disk run one 32k request, or
    split the requests to 4 16k requests and each disk runs two. It's hard to say
    which case is better, depending on hardware.

    So only consider case where there are idle disks. For readahead, split is
    always better in this case. And in my test, below patch can improve > 30%
    thoughput. Hmm, not 100%, because disk isn't 100% busy.

    Such case can happen not just in readahead, for example, in directio. But I
    suppose directio usually will have bigger IO depth and make all disks busy, so
    I ignored it.

    Note: if the raid uses any hard disk, we don't prevent merging. That will make
    performace worse.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • SSD hasn't spindle, distance between requests means nothing. And the original
    distance based algorithm sometimes can cause severe performance issue for SSD
    raid.

    Considering two thread groups, one accesses file A, the other access file B.
    The first group will access one disk and the second will access the other disk,
    because requests are near from one group and far between groups. In this case,
    read balance might keep one disk very busy but the other relative idle. For
    SSD, we should try best to distribute requests to as many disks as possible.
    There isn't spindle move penality anyway.

    With below patch, I can see more than 50% throughput improvement sometimes
    depending on workloads.

    The only exception is small requests can be merged to a big request which
    typically can drive higher throughput for SSD too. Such small requests are
    sequential reads. Unlike hard disk, sequential read which can't be merged (for
    example direct IO, or read without readahead) can be ignored for SSD. Again
    there is no spindle move penality. readahead dispatches small requests and such
    requests can be merged.

    Last patch can help detect sequential read well, at least if concurrent read
    number isn't greater than raid disk number. In that case, distance based
    algorithm doesn't work well too.

    V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
    random IO too. This makes the algorithm generic for raid with SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • Currently the sequential read detection is global wide. It's natural to make it
    per disk based, which can improve the detection for concurrent multiple
    sequential reads. And next patch will make SSD read balance not use distance
    based algorithm, where this change help detect truly sequential read for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li