Doug / smarc-fsl-linux-kernel | Embedian Git Server

03 Oct, 2012

1 commit

033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

29 Sep, 2012

1 commit

c3a086e63 Merge tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm ... Browse Code »

Pull dm fixes from Alasdair G Kergon:
"A few fixes for problems discovered during the 3.6 cycle.

Of particular note, are fixes to the thin target's discard support,
which I hope is finally working correctly; and fixes for multipath
ioctls and device limits when there are no paths."

* tag 'dm-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm verity: fix overflow check
dm thin: fix discard support for data devices
dm thin: tidy discard support
dm: retain table limits when swapping to new table with no devices
dm table: clear add_random unless all devices have it set
dm: handle requests beyond end of device instead of using BUG_ON
dm mpath: only retry ioctl when no paths if queue_if_no_path set
dm thin: do not set discard_zeroes_data

Linus Torvalds
2012-09-29 01:00:01 +0800

27 Sep, 2012

9 commits

80b481240 md/raid10: fix "enough" function for detecting if array is failed. ... Browse Code »

The 'enough' function is written to work with 'near' arrays only
in that is implicitly assumes that the offset from one 'group' of
devices to the next is the same as the number of copies.
In reality it is the number of 'near' copies.

So change it to make this number explicit.

This bug makes it possible to run arrays without enough drives
present, which is dangerous.
It is appropriate for an -stable kernel, but will almost certainly
need to be modified for some of them.

Cc: stable@vger.kernel.org
Reported-by: Jakub Husák
Signed-off-by: NeilBrown

NeilBrown
2012-09-27 10:35:21 +0800
1d55f6bcc dm verity: fix overflow check ... Browse Code »

This patch fixes sector_t overflow checking in dm-verity.

Without this patch, the code checks for overflow only if sector_t is
smaller than long long, not if sector_t and long long have the same size.

Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2012-09-27 06:45:48 +0800
0424caa14 dm thin: fix discard support for data devices ... Browse Code »

The discard limits that get established for a thin-pool or thin device
may be incompatible with the pool's data device. Avoid this by checking
the discard limits of the pool's data device. If an incompatibility is
found then the pool's 'discard passdown' feature is disabled.

Change thin_io_hints to ensure that a thin device always uses the same
queue limits as its pool device.

Introduce requested_pf to track whether or not the table line originally
contained the no_discard_passdown flag and use this directly for table
output. We prepare the correct setting for discard_passdown directly in
bind_control_target (called from pool_io_hints) and store it in
adjusted_pf rather than waiting until we have access to pool->pf in
pool_preresume.

Signed-off-by: Mike Snitzer
Signed-off-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:47 +0800
9bc142dd7 dm thin: tidy discard support ... Browse Code »

A little thin discard code refactoring to make the next patch (dm thin:
fix discard support for data devices) more readable.
Pull out a couple of functions (and uses bools instead of unsigned for
features).

No functional changes.

Signed-off-by: Mike Snitzer
Signed-off-by: Joe Thornber
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:46 +0800
3ae706561 dm: retain table limits when swapping to new table with no devices ... Browse Code »

Add a safety net that will re-use the DM device's existing limits in the
event that DM device has a temporary table that doesn't have any
component devices. This is to reduce the chance that requests not
respecting the hardware limits will reach the device.

DM recalculates queue limits based only on devices which currently exist
in the table. This creates a problem in the event all devices are
temporarily removed such as all paths being lost in multipath. DM will
reset the limits to the maximum permissible, which can then assemble
requests which exceed the limits of the paths when the paths are
restored. The request will fail the blk_rq_check_limits() test when
sent to a path with lower limits, and will be retried without end by
multipath. This became a much bigger issue after v3.6 commit fe86cdcef
("block: do not artificially constrain max_sectors for stacking
drivers").

Reported-by: David Jeffery
Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:45 +0800
c3c4555ed dm table: clear add_random unless all devices have it set ... Browse Code »

Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
have it set. Otherwise devices with predictable characteristics may
contribute entropy.

QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings
contribute to the random pool.

For bio-based targets this flag is always 0 because such devices have no
real queue.

For request-based devices this flag was always set to 1 by default.

Now set it according to the flags on underlying devices. If there is at
least one device which should not contribute, set the flag to zero: If a
device, such as fast SSD storage, is not suitable for supplying entropy,
a request-based queue stacked over it will not be either.

Because the checking logic is exactly same as for the rotational flag,
share the iteration function with device_is_nonrot().

Signed-off-by: Milan Broz
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon

Milan Broz
2012-09-27 06:45:43 +0800
ba1cbad93 dm: handle requests beyond end of device instead of using BUG_ON ... Browse Code »

The access beyond the end of device BUG_ON that was introduced to
dm_request_fn via commit 29e4013de7ad950280e4b2208 ("dm: implement
REQ_FLUSH/FUA support for request-based dm") was an overly
drastic (but simple) response to this situation.

I have received a report that this BUG_ON was hit and now think
it would be better to use dm_kill_unmapped_request() to fail the clone
and original request with -EIO.

map_request() will assign the valid target returned by
dm_table_find_target to tio->ti. But when the target
isn't valid tio->ti is never assigned (because map_request isn't
called); so add a check for tio->ti != NULL to dm_done().

Reported-by: Mike Christie
Signed-off-by: Mike Snitzer
Signed-off-by: Jun'ichi Nomura
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:42 +0800
7ba10aa6f dm mpath: only retry ioctl when no paths if queue_if_no_path set ... Browse Code »

When there are no paths and multipath receives an ioctl, it waits until
a path becomes available. This behaviour is incorrect if the
"queue_if_no_path" setting was not specified, as then the ioctl should
be rejected immediately, which this patch now does.

commit 35991652b ("dm mpath: allow ioctls to trigger pg init") should
have checked if queue_if_no_path was configured before queueing IO.

Checking for the queue_if_no_path feature, like is done in map_io(),
allows the following table load to work without blocking in the
multipath_ioctl retry loop:

echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs

Without this fix the multipath_ioctl will block with the following stack
trace:

blkid D 0000000000000002 0 23936 1 0x00000000
ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440
ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440
ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040
Call Trace:
[] schedule+0x29/0x70
[] schedule_timeout+0x182/0x2e0
[] ? lock_timer_base+0x70/0x70
[] schedule_timeout_uninterruptible+0x1e/0x20
[] msleep+0x20/0x30
[] multipath_ioctl+0x109/0x170 [dm_multipath]
[] dm_blk_ioctl+0xbc/0xd0 [dm_mod]
[] __blkdev_driver_ioctl+0x28/0x30
[] blkdev_ioctl+0xce/0x730
[] block_ioctl+0x3c/0x40
[] do_vfs_ioctl+0x8c/0x340
[] ? sys_newfstat+0x33/0x40
[] sys_ioctl+0xa1/0xb0
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org # 3.5+
Acked-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:41 +0800
307615a26 dm thin: do not set discard_zeroes_data ... Browse Code »

The dm thin pool target claims to support the zeroing of discarded
data areas. This turns out to be incorrect when processing discards
that do not exactly cover a complete number of blocks, so the target
must always set discard_zeroes_data_unsupported.

The thin pool target will zero blocks when they are allocated if the
skip_block_zeroing feature is not specified. The block layer
may send a discard that only partly covers a block. If a thin pool
block is partially discarded then there is no guarantee that the
discarded data will get zeroed before it is accessed again.
Due to this, thin devices cannot claim discards will always zero data.

Signed-off-by: Mike Snitzer
Signed-off-by: Joe Thornber
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Alasdair G Kergon

Mike Snitzer
2012-09-27 06:45:39 +0800

24 Sep, 2012

1 commit

cb13ff69d md/raid5: add missing spin_lock_init. ... Browse Code »

commit b17459c05000fdbe8d10946570a26510f86ec0f
raid5: add a per-stripe lock

added a spin_lock to the 'stripe_head' struct.
Unfortunately there are two places where this struct is allocated
but the spin lock was only initialised in one of them.

So add the missing spin_lock_init.

Signed-off-by: NeilBrown

NeilBrown
2012-09-24 14:27:20 +0800

19 Sep, 2012

3 commits

6dafab6b1 md: make sure metadata is updated when spares are activated or removed. ... Browse Code »

It isn't always necessary to update the metadata when spares are
removed as the presence-or-not of a spare isn't really important to
the integrity of an array.
Also activating a spare doesn't always require updating the metadata
as the update on 'recovery-completed' is usually sufficient.

However the introduction of 'replacement' devices have made these
transitions sometimes more important. For example the 'Replacement'
flag isn't cleared until the original device is removed, so we need
to ensure a metadata update after that 'spare' is removed.

So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
complement the current situation where it is set when a spare is added
or a device is failed (or a number of other less common situations).

This is suitable for -stable as out-of-data metadata could lead
to data corruption.
This is only relevant for 3.3 and later 9when 'replacement' as
introduced.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown

NeilBrown
2012-09-19 10:54:22 +0800
e5c86471f md/raid5: fix calculate of 'degraded' when a replacement becomes active. ... Browse Code »

When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.

So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.

This is suitable for -stable as an incorrect 'degraded' value can
confuse md and could lead to data corruption.
This is only relevant for 3.3 and later.

Cc: stable@vger.kernel.org
Reported-by: Robin Hill
Reported-by: John Drescher
Signed-off-by: NeilBrown

NeilBrown
2012-09-19 10:52:30 +0800
a852d7b8a Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE." ... Browse Code »

This reverts commit 895e3c5c58a80bb9e4e05d9ac38b4f30e0f97d80.

While this patch seemed like a good idea and did help some workloads,
it hurts other workloads.
Large sequential O_DIRECT writes were faster,
Small random O_DIRECT writes were slower.

Other changes (batching RAID5 writes) have improved the sequential
writes using a different mechanism, so the net result of this patch
is definitely negative. So revert it.

Reported-by: Shaohua Li
Tested-by: Jianpeng Ma
Signed-off-by: NeilBrown

NeilBrown
2012-09-19 10:48:30 +0800

21 Aug, 2012

1 commit

43829731d workqueue: deprecate flush[_delayed]_work_sync() ... Browse Code »

flush[_delayed]_work_sync() are now spurious. Mark them deprecated
and convert all users to flush[_delayed]_work().

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Cc: Russell King
Cc: Paul Mundt
Cc: Ian Campbell
Cc: Jens Axboe
Cc: Mattia Dongili
Cc: Kent Yoder
Cc: David Airlie
Cc: Jiri Kosina
Cc: Karsten Keil
Cc: Bryan Wu
Cc: Benjamin Herrenschmidt
Cc: Alasdair Kergon
Cc: Mauro Carvalho Chehab
Cc: Florian Tobias Schandinat
Cc: David Woodhouse
Cc: "David S. Miller"
Cc: linux-wireless@vger.kernel.org
Cc: Anton Vorontsov
Cc: Sangbeom Kim
Cc: "James E.J. Bottomley"
Cc: Greg Kroah-Hartman
Cc: Eric Van Hensbergen
Cc: Takashi Iwai
Cc: Steven Whitehouse
Cc: Petr Vandrovec
Cc: Mark Fasheh
Cc: Christoph Hellwig
Cc: Avi Kivity

Tejun Heo
2012-08-21 05:51:24 +0800

18 Aug, 2012

1 commit

e0ee77852 md/raid10: fix problem with on-stack allocation of r10bio structure. ... Browse Code »

A 'struct r10bio' has an array of per-copy information at the end.
This array is declared with size [0] and r10bio_pool_alloc allocates
enough extra space to store the per-copy information depending on the
number of copies needed.

So declaring a 'struct r10bio on the stack isn't going to work. It
won't allocate enough space, and memory corruption will ensue.

So in the two places where this is done, declare a sufficiently large
structure and use that instead.

The two call-sites of this bug were introduced in 3.4 and 3.5
so this is suitable for both those kernels. The patch will have to
be modified for 3.4 as it only has one bug.

Cc: stable@vger.kernel.org
Reported-by: Ivan Vasilyev
Tested-by: Ivan Vasilyev
Signed-off-by: NeilBrown

NeilBrown
2012-08-18 07:51:42 +0800

16 Aug, 2012

1 commit

667a5313e md: Don't truncate size at 4TB for RAID0 and Linear ... Browse Code »

commit 27a7b260f71439c40546b43588448faac01adb93
md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.

changed 0.90 metadata handling to truncated size to 4TB as that is
all that 0.90 can record.
However for RAID0 and Linear, 0.90 doesn't need to record the size, so
this truncation is not needed and causes working arrays to become too small.

So avoid the truncation for RAID0 and Linear

This bug was introduced in 3.1 and is suitable for any stable kernels
from then onwards.
As the offending commit was tagged for 'stable', any stable kernel
that it was applied to should also get this patch. That includes
at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
providing that list).

Cc: stable@vger.kernel.org
Signed-off-by: Neil Brown

NeilBrown
2012-08-16 14:46:12 +0800

03 Aug, 2012

1 commit

25aa6a7ae Merge tag 'md-3.6' of git://neil.brown.name/md ... Browse Code »

Pull additional md update from NeilBrown:
"This contains a few patches that depend on plugging changes in the
block layer so needed to wait for those.

It also contains a Kconfig fix for the new RAID10 support in dm-raid."

* tag 'md-3.6' of git://neil.brown.name/md:
md/dm-raid: DM_RAID should select MD_RAID10
md/raid1: submit IO from originating thread instead of md thread.
raid5: raid5d handle stripe in batch way
raid5: make_request use batch stripe release

Linus Torvalds
2012-08-03 02:34:40 +0800

02 Aug, 2012

6 commits

d9f691c36 md/dm-raid: DM_RAID should select MD_RAID10 ... Browse Code »

Now that DM_RAID supports raid10, it needs to select that code
to ensure it is included.

Cc: Jonathan Brassow
Reported-by: Fengguang Wu
Signed-off-by: NeilBrown

NeilBrown
2012-08-02 06:35:43 +0800
f54a9d0e5 md/raid1: submit IO from originating thread instead of md thread. ... Browse Code »

queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.

So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.

Signed-off-by: NeilBrown

NeilBrown
2012-08-02 06:33:20 +0800
46a06401f raid5: raid5d handle stripe in batch way ... Browse Code »

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-08-02 06:33:15 +0800
8811b5968 raid5: make_request use batch stripe release ... Browse Code »

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-08-02 06:33:00 +0800
eff0d13f3 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver changes from Jens Axboe:

- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.

- The usual round of drbd updates.

- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.

- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous

Linus Torvalds
2012-08-02 00:06:47 +0800
fcff06c43 Merge branch 'for-next' of git://neil.brown.name/md ... Browse Code »

Pull md updates from NeilBrown.

* 'for-next' of git://neil.brown.name/md:
DM RAID: Add support for MD RAID10
md/RAID1: Add missing case for attempting to repair known bad blocks.
md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
md/raid1: don't abort a resync on the first badblock.
md: remove duplicated test on ->openers when calling do_md_stop()
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
md/raid1: prevent merging too large request
md/raid1: read balance chooses idlest disk for SSD
md/raid1: make sequential read detection per disk based
MD RAID10: Export md_raid10_congested
MD: Move macros from raid1*.h to raid1*.c
MD RAID1: rename mirror_info structure
MD RAID10: rename mirror_info structure
MD RAID10: Fix compiler warning.
raid5: add a per-stripe lock
raid5: remove unnecessary bitmap write optimization
raid5: lockless access raid5 overrided bi_phys_segments
raid5: reduce chance release_stripe() taking device_lock

Linus Torvalds
2012-08-02 00:02:01 +0800

01 Aug, 2012

2 commits

63f33b8dd DM RAID: Add support for MD RAID10 ... Browse Code »

Support the MD RAID10 personality through dm-raid.c

Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown

Jonathan Brassow
2012-08-01 18:41:20 +0800
bb181e2e4 Merge commit 'c039c332f23e794deb6d6f37b9f07ff3b27fb2cf ' into md ... Browse Code »

Pull in pre-requisites for adding raid10 support to dm-raid.

NeilBrown
2012-08-01 18:40:02 +0800

31 Jul, 2012

13 commits

74018dc30 blk: pass from_schedule to non-request unplug functions. ... Browse Code »

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:15 +0800
9cbb17508 blk: centralize non-request unplug handling. ... Browse Code »

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:14 +0800
0021b7bc0 md: remove plug_cnt feature of plugging. ... Browse Code »

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:14 +0800
d57368afe md/RAID1: Add missing case for attempting to repair known bad blocks. ... Browse Code »

When doing resync or repair, attempt to correct bad blocks, according
to WriteErrorSeen policy

Signed-off-by: Alex Lyakas
Signed-off-by: NeilBrown

Alexander Lyakas
2012-07-31 10:01:29 +0800
27c1ee3f9 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge Andrew's first set of patches:
"Non-MM patches:

- lots of misc bits

- tree-wide have_clk() cleanups

- quite a lot of printk tweaks. I draw your attention to "printk:
convert the format for KERN_ to a 2 byte pattern" which
looks a bit scary. But afaict it's solid.

- backlight updates

- lib/ feature work (notably the addition and use of memweight())

- checkpatch updates

- rtc updates

- nilfs updates

- fatfs updates (partial, still waiting for acks)

- kdump, proc, fork, IPC, sysctl, taskstats, pps, etc

- new fault-injection feature work"

* Merge emailed patches from Andrew Morton : (128 commits)
drivers/misc/lkdtm.c: fix missing allocation failure check
lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
fault-injection: add tool to run command with failslab or fail_page_alloc
fault-injection: add selftests for cpu and memory hotplug
powerpc: pSeries reconfig notifier error injection module
memory: memory notifier error injection module
PM: PM notifier error injection module
cpu: rewrite cpu-notifier-error-inject module
fault-injection: notifier error injection
c/r: fcntl: add F_GETOWNER_UIDS option
resource: make sure requested range is included in the root range
include/linux/aio.h: cpp->C conversions
fs: cachefiles: add support for large files in filesystem caching
pps: return PTR_ERR on error in device_create
taskstats: check nla_reserve() return
sysctl: suppress kmemleak messages
ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
ipc: compat: use signed size_t types for msgsnd and msgrcv
ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
ipc: add COMPAT_SHMLBA support
...

Linus Torvalds
2012-07-31 08:25:34 +0800
8fb980e35 dm: use memweight() ... Browse Code »

Use memweight() to count the total number of bits set in memory area.

Signed-off-by: Akinobu Mita
Cc: Alasdair Kergon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2012-07-31 08:25:16 +0800
895e3c5c5 md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. ... Browse Code »

'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.

So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC. This is consistent with the documented meaning
of REQ_NOIDLE:

__REQ_NOIDLE, /* don't anticipate more IO after this one */

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

majianpeng
2012-07-31 08:05:44 +0800
b7219ccb3 md/raid1: don't abort a resync on the first badblock. ... Browse Code »

If a resync of a RAID1 array with 2 devices finds a known bad block
one device it will neither read from, or write to, that device for
this block offset.
So there will be one read_target (The other device) and zero write
targets.
This condition causes md/raid1 to abort the resync assuming that it
has finished - without known bad blocks this would be true.

When there are no write targets because of the presence of bad blocks
we should only skip over the area covered by the bad block.
RAID10 already gets this right, raid1 doesn't. Or didn't.

As this can cause a 'sync' to abort early and appear to have succeeded
it could lead to some data corruption, so it suitable for -stable.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas
Signed-off-by: NeilBrown

NeilBrown
2012-07-31 08:05:34 +0800
90cf195d9 md: remove duplicated test on ->openers when calling do_md_stop() ... Browse Code »

do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.

So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.

Signed-off-by: NeilBrown

NeilBrown
2012-07-31 08:04:55 +0800
3f9e7c140 raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer ... Browse Code »

Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.

V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

majianpeng
2012-07-31 08:04:21 +0800
12cee5a8a md/raid1: prevent merging too large request ... Browse Code »

For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.

If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-07-31 08:03:53 +0800
9dedf6031 md/raid1: read balance chooses idlest disk for SSD ... Browse Code »

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle. For
SSD, we should try best to distribute requests to as many disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-07-31 08:03:53 +0800
be4d3280b md/raid1: make sequential read detection per disk based ... Browse Code »

Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-07-31 08:03:53 +0800