Eric Lee / linux-smarc-t335x-v3.2

29 Dec, 2008

1 commit

bb799ca02 bio: allow individual slabs in the bio_set ... Browse Code »

Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.

This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.

Signed-off-by: Jens Axboe

Jens Axboe
2008-12-29 15:29:23 +0800

25 Dec, 2008

1 commit

db8862eaf Merge branch 'linus' into tracing/hw-branch-tracing Browse Code »

Ingo Molnar
2008-12-25 04:08:26 +0800

19 Dec, 2008

1 commit

a2ed9615e md: Don't read past end of bitmap when reading bitmap. ... Browse Code »

When we read the write-intent-bitmap off the device, we currently
read a whole number of pages.
When PAGE_SIZE is 4K, this works due to the alignment we enforce
on the superblock and bitmap.
When PAGE_SIZE is 64K, this case read past the end-of-device
which causes an error.

When we write the superblock, we ensure to clip the last page
to just be the required size. Copy that code into the read path
to just read the required number of sectors.

Signed-off-by: Neil Brown
Cc: stable@kernel.org

NeilBrown
2008-12-19 13:25:01 +0800

05 Dec, 2008

1 commit

970987beb Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/ur… ... Browse Code »

…gent' into tracing/core

Ingo Molnar
2008-12-05 21:45:22 +0800

03 Dec, 2008

1 commit

0e435ac26 block: fix setting of max_segment_size and seg_boundary mask ... Browse Code »

Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
devices.

When stacking devices (LVM over MD over SCSI) some of the request queue
parameters are not set up correctly in some cases by default, namely
max_segment_size and and seg_boundary mask.

If you create MD device over SCSI, these attributes are zeroed.

Problem become when there is over this mapping next device-mapper mapping
- queue attributes are set in DM this way:

request_queue max_segment_size seg_boundary_mask
SCSI 65536 0xffffffff
MD RAID1 0 0
LVM 65536 -1 (64bit)

Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
physical segments according to these parameters.

During the generic_make_request() is segment cout recalculated and can
increase bio->bi_phys_segments count over the allowed limit. (After
bio_clone() in stack operation.)

Thi is specially problem in CCISS driver, where it produce OOPS here

BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

(MAXSEGENTRIES is 31 by default.)

Sometimes even this command is enough to cause oops:

dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

This command generates bios with 250 sectors, allocated in 32 4k-pages
(last page uses only 1024 bytes).

For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and this
violates CCISS restriction and triggers BUG_ON().

The patch tries to fix it by:

* initializing attributes above in queue request constructor
blk_queue_make_request()

* make sure that blk_queue_stack_limits() inherits setting

(DM uses its own function to set the limits because it
blk_queue_stack_limits() was introduced later. It should probably switch
to use generic stack limit function too.)

* sets the default seg_boundary value in one place (blkdev.h)

* use this mask as default in DM (instead of -1, which differs in 64bit)

Bugs related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=471639
http://bugzilla.kernel.org/show_bug.cgi?id=8672

Signed-off-by: Milan Broz
Reviewed-by: Alasdair G Kergon
Cc: Neil Brown
Cc: FUJITA Tomonori
Cc: Tejun Heo
Cc: Mike Miller
Signed-off-by: Jens Axboe

Milan Broz
2008-12-03 19:55:55 +0800

26 Nov, 2008

2 commits

0bfc24559 blktrace: port to tracepoints, update ... Browse Code »

Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
sites. Spread them out to the usage sites, as suggested by
Mathieu Desnoyers.

Signed-off-by: Ingo Molnar
Acked-by: Mathieu Desnoyers

Ingo Molnar
2008-11-26 20:04:35 +0800
5f3ea37c7 blktrace: port to tracepoints ... Browse Code »

This was a forward port of work done by Mathieu Desnoyers, I changed it to
encode the 'what' parameter on the tracepoint name, so that one can register
interest in specific events and not on classes of events to then check the
'what' parameter.

Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: Jens Axboe
Signed-off-by: Ingo Molnar

Arnaldo Carvalho de Melo
2008-11-26 19:13:34 +0800

14 Nov, 2008

6 commits

8a57dfc6f dm: avoid destroying table in dm_any_congested ... Browse Code »

dm_any_congested() just checks for the DMF_BLOCK_IO and has no
code to make sure that suspend waits for dm_any_congested() to
complete. This patch adds such a check.

Without it, a race can occur with dm_table_put() attempting to
destroying the table in the wrong thread, the one running
dm_any_congested() which is meant to be quick and return
immediately.

Two examples of problems:
1. Sleeping functions called from congested code, the caller
of which holds a spin lock.
2. An ABBA deadlock between pdflush and multipathd. The two locks
in contention are inode lock and kernel lock.

Signed-off-by: Chandra Seetharaman
Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Chandra Seetharaman
2008-11-14 07:39:14 +0800
d221d2e77 dm: move pending queue wake_up end_io_acct ... Browse Code »

This doesn't fix any bug, just moves wake_up immediately after decrementing
md->pending, for better code readability.

It must be clear to anyone manipulating md->pending to wake up
the queue if md->pending reaches zero, so move the wakeup as close to
the decrementing as possible.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-11-14 07:39:10 +0800
14e98c5ca dm mpath: warn if args ignored ... Browse Code »

Currently dm ignores the parameters provided to hardware handlers
without providing any notifications to the user.

This patch just prints a warning message so that the user knows that
the arguments are ignored.

Signed-off-by: Chandra Seetharaman
Signed-off-by: Alasdair G Kergon

Chandra Seetharaman
2008-11-14 07:39:06 +0800
b81aa1c79 dm mpath: avoid attempting to activate null path ... Browse Code »

Path activation code is called even when the pgpath is NULL. This could
lead to a panic in activate_path(). Such a panic is seen in -rt kernel.

This problem has been there before the pg_init() was moved to a
workqueue.

Signed-off-by: Chandra Seetharaman
Signed-off-by: Alasdair G Kergon

Chandra Seetharaman
2008-11-14 07:39:00 +0800
6edebdee4 dm stripe: fix init failure ... Browse Code »

Don't proceed if dm_stripe_init() fails to register itself as a dm target.

Signed-off-by: Heinz Mauelshagen
Signed-off-by: Alasdair G Kergon

Heinz Mauelshagen
2008-11-14 07:38:56 +0800
18776c731 dm raid1: flush workqueue before destruction ... Browse Code »

We queue work on keventd queue --- so this queue must be flushed in the
destructor. Otherwise, keventd could access mirror_set after it was freed.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon
Cc: stable@kernel.org

Mikulas Patocka
2008-11-14 07:38:52 +0800

06 Nov, 2008

3 commits

f1cd14ae5 md: linear: Fix a division by zero bug for very small arrays. ... Browse Code »

We currently oops with a divide error on starting a linear software
raid array consisting of at least two very small (< 500K) devices.

The bug is caused by the calculation of the hash table size which
tries to compute sector_div(sz, base) with "base" being zero due to
the small size of the component devices of the array.

Fix this by requiring the hash spacing to be at least one which
implies that also "base" is non-zero.

This bug has existed since about 2.6.14.

Cc: stable@kernel.org
Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2008-11-06 16:41:24 +0800
a53a6c857 md: fix bug in raid10 recovery. ... Browse Code »

Adding a spare to a raid10 doesn't cause recovery to start.
This is due to an silly type in
commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda
and so is a bug in 2.6.27 and .28-rc.

Thanks to Thomas Backlund for bisecting to find this.

Cc: Thomas Backlund
Cc: stable@kernel.org

Signed-off-by: NeilBrown

NeilBrown
2008-11-06 14:28:20 +0800
cb3ac42b8 md: revert the recent addition of a call to the BLKRRPART ioctl. ... Browse Code »

It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close). And it
is quite possible for do_md_stop to be called when the device is not
open. So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
commit 934d9c23b4c7e31840a895ba4b7e88d6413c81f3

It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel

Signed-off-by: NeilBrown

NeilBrown
2008-11-06 14:28:01 +0800

31 Oct, 2008

1 commit

721d5dfe7 Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md:
md: destroy partitions and notify udev when md array is stopped.

Linus Torvalds
2008-10-31 09:36:16 +0800

30 Oct, 2008

3 commits

879129d20 dm snapshot: wait for chunks in destructor ... Browse Code »

If there are several snapshots sharing an origin and one is removed
while the origin is being written to, the snapshot's mempool may get
deleted while elements are still referenced.

Prior to dm-snapshot-use-per-device-mempools.patch the pending
exceptions may still have been referenced after the snapshot was
destroyed, but this was not a problem because the shared mempool
was still there.

This patch fixes the problem by tracking the number of mempool elements
in use.

The scenario:
- You have an origin and two snapshots 1 and 2.
- Someone writes to the origin.
- It creates two exceptions in the snapshots, snapshot 1 will be primary
exception, snapshot 2's pending_exception->primary_pe will point to the
exception in snapshot 1.
- The exceptions are being relocated, relocation of exception 1 finishes
(but it's pending_exception is still allocated, because it is referenced
by an exception from snapshot 2)
- The user lvremoves snapshot 1 --- it calls just suspend (does nothing)
and destructor. md->pending is zero (there is no I/O submitted to the
snapshot by md layer), so it won't help us.
- The destructor waits for kcopyd jobs to finish on snapshot 1 --- but
there are none.
- The destructor on snapshot 1 cleans up everything.
- The relocation of exception on snapshot 2 finishes, it drops reference
on primary_pe. This frees its primary_pe pointer. Primary_pe points to
pending exception created for snapshot 1. So it frees memory into
non-existing mempool.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-10-30 21:33:16 +0800
60c856c8e dm snapshot: fix register_snapshot deadlock ... Browse Code »

register_snapshot() performs a GFP_KERNEL allocation while holding
_origins_lock for write, but that could write out dirty pages onto a
device that attempts to acquire _origins_lock for read, resulting in
deadlock.

So move the allocation up before taking the lock.

This path is not performance-critical, so it doesn't matter that we
allocate memory and free it if we find that we won't need it.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-10-30 21:33:12 +0800
b34578a48 dm raid1: fix do_failures ... Browse Code »

Missing braces. Commit 1f965b1943 (dm raid1: separate region_hash interface
part1) broke it.

Signed-off-by: Ilpo Jarvinen
Signed-off-by: Alasdair G Kergon
Cc: Heinz Mauelshagen

Ilpo Jarvinen
2008-10-30 21:33:07 +0800

28 Oct, 2008

1 commit

934d9c23b md: destroy partitions and notify udev when md array is stopped. ... Browse Code »

md arrays are not currently destroyed when they are stopped - they
remain in /sys/block. Last time I tried this I tripped over locking
too much.

A consequence of this is that udev doesn't remove anything from /dev.
This is rather ugly.

As an interim measure until proper device removal can be achieved,
make sure all partitions are removed using the BLKRRPART ioctl, and
send a KOBJ_CHANGE when an md array is stopped.

Signed-off-by: NeilBrown

NeilBrown
2008-10-28 14:01:23 +0800

27 Oct, 2008

1 commit

f8d56f177 Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md:
md: allow extended partitions on md devices.
md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state
md: use sysfs_notify_dirent to notify changes to md/array_state

Linus Torvalds
2008-10-27 07:42:18 +0800

24 Oct, 2008

2 commits

224848564 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
[PATCH] kill the rest of struct file propagation in block ioctls
[PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
[PATCH] get rid of blkdev_locked_ioctl()
[PATCH] get rid of blkdev_driver_ioctl()
[PATCH] sanitize blkdev_get() and friends
[PATCH] remember mode of reiserfs journal
[PATCH] propagate mode through swsusp_close()
[PATCH] propagate mode through open_bdev_excl/close_bdev_excl
[PATCH] pass fmode_t to blkdev_put()
[PATCH] kill the unused bsize on the send side of /dev/loop
[PATCH] trim file propagation in block/compat_ioctl.c
[PATCH] end of methods switch: remove the old ones
[PATCH] switch sr
[PATCH] switch sd
[PATCH] switch ide-scsi
[PATCH] switch tape_block
[PATCH] switch dcssblk
[PATCH] switch dasd
[PATCH] switch mtd_blkdevs
[PATCH] switch mmc
...

Linus Torvalds
2008-10-24 01:23:07 +0800
5ed487bc2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
[PATCH] fs: add a sanity check in d_free
[PATCH] i_version: remount support
[patch] vfs: make security_inode_setattr() calling consistent
[patch 1/3] FS_MBCACHE: don't needlessly make it built-in
[PATCH] move executable checking into ->permission()
[PATCH] fs/dcache.c: update comment of d_validate()
[RFC PATCH] touch_mnt_namespace when the mount flags change
[PATCH] reiserfs: add missing llseek method
[PATCH] fix ->llseek for more directories
[PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
[PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
[PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
[PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
[PATCH vfs-2.6 2/6] vfs: add d_ancestor()
[PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
[PATCH] get rid of on-stack dentry in udf
[PATCH 2/2] anondev: switch to IDA
[PATCH 1/2] anondev: init IDR statically
[JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
[PATCH] Optimise NFS readdir hack slightly.
...

Linus Torvalds
2008-10-24 01:22:40 +0800

23 Oct, 2008

1 commit

72e8264ed [PATCH] dm: kill lookup_device wrapper ... Browse Code »

Now that lookup_bdev is exported and used by dm just use it directly
instead of through a trivial wrapper.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2008-10-23 17:12:57 +0800

22 Oct, 2008

14 commits

51157b4ab dm: tidy local_init ... Browse Code »

This patch tidies local_init() in preparation for request-based dm.
No functional change.

Signed-off-by: Kiyoshi Ueda
Signed-off-by: Jun'ichi Nomura
Signed-off-by: Alasdair G Kergon

Kiyoshi Ueda
2008-10-22 00:45:08 +0800
f431d9666 dm: remove unused flush_all ... Browse Code »

This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.

The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
- 'goto flush_and_out' is the same as 'goto out' because
the 'goto flush_and_out' is called only when '!noflush'
- If r is non-zero, then the code above will invoke 'goto out'
and skip this code.

No functional change.

Signed-off-by: Kiyoshi Ueda
Signed-off-by: Jun'ichi Nomura
Signed-off-by: Milan Broz
Signed-off-by: Alasdair G Kergon

Kiyoshi Ueda
2008-10-22 00:45:07 +0800
1f965b194 dm raid1: separate region_hash interface part1 ... Browse Code »

Separate the region hash code from raid1 so it can be shared by forthcoming
targets. Use BUG_ON() for failed async dm_io() calls.

Signed-off-by: Heinz Mauelshagen
Signed-off-by: Alasdair G Kergon

Heinz Mauelshagen
2008-10-22 00:45:06 +0800
f3e1d26ed dm: mark split bio as cloned ... Browse Code »

When a bio gets split, mark its fragments with the BIO_CLONED flag.

Signed-off-by: Martin K. Petersen
Signed-off-by: Alasdair G Kergon

Martin K. Petersen
2008-10-22 00:45:04 +0800
0a4a1047a dm crypt: remove waitqueue ... Browse Code »

Remove waitqueue no longer needed with the async crypto interface.

Signed-off-by: Milan Broz
Signed-off-by: Alasdair G Kergon

Milan Broz
2008-10-22 00:45:03 +0800
393b47ef2 dm crypt: fix async split ... Browse Code »

When writing io, dm-crypt has to allocate a new cloned bio
and encrypt the data into newly-allocated pages attached to this bio.
In rare cases, because of hw restrictions (e.g. physical segment limit)
or memory pressure, sometimes more than one cloned bio has to be used,
each processing a different fragment of the original.

Currently there is one waitqueue which waits for one fragment to finish
and continues processing the next fragment.

But when using asynchronous crypto this doesn't work, because several
fragments may be processed asynchronously or in parallel and there is
only one crypt context that cannot be shared between the bio fragments.
The result may be corruption of the data contained in the encrypted bio.

The patch fixes this by allocating new dm_crypt_io structs (with new
crypto contexts) and running them independently.

The fragments contains a pointer to the base dm_crypt_io struct to
handle reference counting, so the base one is properly deallocated
after all the fragments are finished.

In a low memory situation, this only uses one additional object from the
mempool. If the mempool is empty, the next allocation simple waits for
previous fragments to complete.

Signed-off-by: Milan Broz
Signed-off-by: Alasdair G Kergon

Milan Broz
2008-10-22 00:45:02 +0800
b635b00e0 dm crypt: tidy sector ... Browse Code »

Prepare local sector variable (offset) for later patch.
Do not update io->sector for still-running I/O.

No functional change.

Signed-off-by: Milan Broz
Signed-off-by: Alasdair G Kergon

Milan Broz
2008-10-22 00:45:00 +0800
586e80e6e dm: remove dm header from targets ... Browse Code »

Change #include "dm.h" to #include in all targets.
Targets should not need direct access to internal DM structures.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-10-22 00:44:59 +0800
d63a5ce3c dm: publish array_too_big ... Browse Code »

Move array_too_big to include/linux/device-mapper.h because it is
used by targets.

Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail. (Even for stripes it seems rather
unlikely.)

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-10-22 00:44:57 +0800
7acedc5b9 dm exception store: fix misordered writes ... Browse Code »

We must zero the next chunk on disk *before* writing out the current chunk, not
after. Otherwise if the machine crashes at the wrong time, the "end of metadata"
marker may be missing.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon
Cc: stable@kernel.org

Mikulas Patocka
2008-10-22 00:44:56 +0800
7c9e6c173 dm exception store: refactor zero_area ... Browse Code »

Use a separate buffer for writing zeroes to the on-disk snapshot
exception store, make the updating of ps->current_area explicit and
refactor the code in preparation for the fix in the next patch.

No functional change.

Signed-off-by: Alasdair G Kergon
Signed-off-by: Mikulas Patocka
Cc: stable@kernel.org

Alasdair G Kergon
2008-10-22 00:44:55 +0800
f68d4f3d3 dm snapshot: drop unused last_percent ... Browse Code »

The last_percent field is unused - remove it.
(It dates from when events were triggered as each X% filled up.)

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-10-22 00:44:53 +0800
7c5f78b9d dm snapshot: fix primary_pe race ... Browse Code »

Fix a race condition with primary_pe ref_count handling.

put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.

__origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
dm_snapshot->lock.

This opens the following race condition:
Assume two CPUs, CPU1 is executing put_pending_exception (and holding
dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
primary_pe->ref_count == 2.

CPU1:
if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
origin_bios = bio_list_get(&primary_pe->origin_bios);
... decrements primary_pe->ref_count to 1. Doesn't load origin_bios

CPU2:
if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
flush_bios(bio_list_get(&primary_pe->origin_bios));
free_pending_exception(primary_pe);
/* If we got here, pe_queue is necessarily empty. */
return r;
}
... decrements primary_pe->ref_count to 0, submits pending bios, frees
primary_pe.

CPU1:
if (!primary_pe || primary_pe != pe)
free_pending_exception(pe);
... this has no effect.
if (primary_pe && !atomic_read(&primary_pe->ref_count))
free_pending_exception(primary_pe);
... sees ref_count == 0 (written by CPU 2), does double free !!

This bug can happen only if someone is simultaneously writing to both the
origin and the snapshot.

If someone is writing only to the origin, __origin_write will submit kcopyd
request after it decrements primary_pe->ref_count (so it can't happen that the
finished copy races with primary_pe->ref_count decrementation).

If someone is writing only to the snapshot, __origin_write isn't invoked at all
and the race can't happen.

The race happens when someone writes to the snapshot --- this creates
pending_exception with primary_pe == NULL and starts copying. Then, someone
writes to the same chunk in the snapshot, and __origin_write races with
termination of already submitted request in pending_complete (that calls
put_pending_exception).

This race may be reason for bugs:
http://bugzilla.kernel.org/show_bug.cgi?id=11636
https://bugzilla.redhat.com/show_bug.cgi?id=465825

The patch fixes the code to make sure that:
1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
must no longer dereference primary_pe (because someone else may free it under
us).
2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
is responsible for freeing primary_pe.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon
Cc: stable@kernel.org

Mikulas Patocka
2008-10-22 00:44:51 +0800
b673c3a81 dm kcopyd: avoid queue shuffle ... Browse Code »

Write throughput to LVM snapshot origin volume is an order
of magnitude slower than those to LV without snapshots or
snapshot target volumes, especially in the case of sequential
writes with O_SYNC on.

The following patch originally written by Kevin Jamieson and
Jan Blunck and slightly modified for the current RCs by myself
tries to improve the performance by modifying the behaviour
of kcopyd, so that it pushes back an I/O job to the head of
the job queue instead of the tail as process_jobs() currently
does when it has to wait for free pages. This way, write
requests aren't shuffled to cause extra seeks.

I tested the patch against 2.6.27-rc5 and got the following results.
The test is a dd command writing to snapshot origin followed by fsync
to the file just created/updated. A couple of filesystem benchmarks
gave me similar results in case of sequential writes, while random
writes didn't suffer much.

dd if=/dev/zero of= bs=4096 count=...
[conv=notrunc when updating]

1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
average throughput (MB/s)
10M 100M 1000M
create,dd 511.46 610.72 11.81
create,dd+fsync 7.10 6.77 8.13
update,dd 431.63 917.41 12.75
update,dd+fsync 7.79 7.43 8.12

compared with write throughput to LV without any snapshots,
all dd+fsync and 1000 MiB writes perform very poorly.

10M 100M 1000M
create,dd 555.03 608.98 123.29
create,dd+fsync 114.27 72.78 76.65
update,dd 152.34 1267.27 124.04
update,dd+fsync 130.56 77.81 77.84

2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
average throughput (MB/s)

10M 100M 1000M
create,dd 537.06 589.44 46.21
create,dd+fsync 31.63 29.19 29.23
update,dd 487.59 897.65 37.76
update,dd+fsync 34.12 30.07 26.85

Although still not on par with plain LV performance -
cannot be avoided because it's copy on write anyway -
this simple patch successfully improves throughtput
of dd+fsync while not affecting the rest.

Signed-off-by: Jan Blunck
Signed-off-by: Kazuo Ito
Signed-off-by: Alasdair G Kergon
Cc: stable@kernel.org

Kazuo Ito
2008-10-22 00:44:50 +0800

21 Oct, 2008

1 commit

9a1c35427 [PATCH] pass fmode_t to blkdev_put() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2008-10-21 19:48:58 +0800