29 Dec, 2008

1 commit

  • Instead of having a global bio slab cache, add a reference to one
    in each bio_set that is created. This allows for personalized slabs
    in each bio_set, so that they can have bios of different sizes.

    This means we can personalize the bios we return. File systems may
    want to embed the bio inside another structure, to avoid allocation
    more items (and stuffing them in ->bi_private) after the get a bio.
    Or we may want to embed a number of bio_vecs directly at the end
    of a bio, to avoid doing two allocations to return a bio. This is now
    possible.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Dec, 2008

1 commit


19 Dec, 2008

1 commit

  • When we read the write-intent-bitmap off the device, we currently
    read a whole number of pages.
    When PAGE_SIZE is 4K, this works due to the alignment we enforce
    on the superblock and bitmap.
    When PAGE_SIZE is 64K, this case read past the end-of-device
    which causes an error.

    When we write the superblock, we ensure to clip the last page
    to just be the required size. Copy that code into the read path
    to just read the required number of sectors.

    Signed-off-by: Neil Brown
    Cc: stable@kernel.org

    NeilBrown
     

05 Dec, 2008

1 commit


03 Dec, 2008

1 commit

  • Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
    devices.

    When stacking devices (LVM over MD over SCSI) some of the request queue
    parameters are not set up correctly in some cases by default, namely
    max_segment_size and and seg_boundary mask.

    If you create MD device over SCSI, these attributes are zeroed.

    Problem become when there is over this mapping next device-mapper mapping
    - queue attributes are set in DM this way:

    request_queue max_segment_size seg_boundary_mask
    SCSI 65536 0xffffffff
    MD RAID1 0 0
    LVM 65536 -1 (64bit)

    Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of
    physical segments according to these parameters.

    During the generic_make_request() is segment cout recalculated and can
    increase bio->bi_phys_segments count over the allowed limit. (After
    bio_clone() in stack operation.)

    Thi is specially problem in CCISS driver, where it produce OOPS here

    BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

    (MAXSEGENTRIES is 31 by default.)

    Sometimes even this command is enough to cause oops:

    dd iflag=direct if=/dev// of=/dev/null bs=128000 count=10

    This command generates bios with 250 sectors, allocated in 32 4k-pages
    (last page uses only 1024 bytes).

    For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
    unfortunatelly on lower layer it is recalculated to 32 segments and this
    violates CCISS restriction and triggers BUG_ON().

    The patch tries to fix it by:

    * initializing attributes above in queue request constructor
    blk_queue_make_request()

    * make sure that blk_queue_stack_limits() inherits setting

    (DM uses its own function to set the limits because it
    blk_queue_stack_limits() was introduced later. It should probably switch
    to use generic stack limit function too.)

    * sets the default seg_boundary value in one place (blkdev.h)

    * use this mask as default in DM (instead of -1, which differs in 64bit)

    Bugs related to this:
    https://bugzilla.redhat.com/show_bug.cgi?id=471639
    http://bugzilla.kernel.org/show_bug.cgi?id=8672

    Signed-off-by: Milan Broz
    Reviewed-by: Alasdair G Kergon
    Cc: Neil Brown
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Miller
    Signed-off-by: Jens Axboe

    Milan Broz
     

26 Nov, 2008

2 commits

  • Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE()
    sites. Spread them out to the usage sites, as suggested by
    Mathieu Desnoyers.

    Signed-off-by: Ingo Molnar
    Acked-by: Mathieu Desnoyers

    Ingo Molnar
     
  • This was a forward port of work done by Mathieu Desnoyers, I changed it to
    encode the 'what' parameter on the tracepoint name, so that one can register
    interest in specific events and not on classes of events to then check the
    'what' parameter.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

14 Nov, 2008

6 commits

  • dm_any_congested() just checks for the DMF_BLOCK_IO and has no
    code to make sure that suspend waits for dm_any_congested() to
    complete. This patch adds such a check.

    Without it, a race can occur with dm_table_put() attempting to
    destroying the table in the wrong thread, the one running
    dm_any_congested() which is meant to be quick and return
    immediately.

    Two examples of problems:
    1. Sleeping functions called from congested code, the caller
    of which holds a spin lock.
    2. An ABBA deadlock between pdflush and multipathd. The two locks
    in contention are inode lock and kernel lock.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • This doesn't fix any bug, just moves wake_up immediately after decrementing
    md->pending, for better code readability.

    It must be clear to anyone manipulating md->pending to wake up
    the queue if md->pending reaches zero, so move the wakeup as close to
    the decrementing as possible.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Currently dm ignores the parameters provided to hardware handlers
    without providing any notifications to the user.

    This patch just prints a warning message so that the user knows that
    the arguments are ignored.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • Path activation code is called even when the pgpath is NULL. This could
    lead to a panic in activate_path(). Such a panic is seen in -rt kernel.

    This problem has been there before the pg_init() was moved to a
    workqueue.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Alasdair G Kergon

    Chandra Seetharaman
     
  • Don't proceed if dm_stripe_init() fails to register itself as a dm target.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • We queue work on keventd queue --- so this queue must be flushed in the
    destructor. Otherwise, keventd could access mirror_set after it was freed.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Mikulas Patocka
     

06 Nov, 2008

3 commits

  • We currently oops with a divide error on starting a linear software
    raid array consisting of at least two very small (< 500K) devices.

    The bug is caused by the calculation of the hash table size which
    tries to compute sector_div(sz, base) with "base" being zero due to
    the small size of the component devices of the array.

    Fix this by requiring the hash spacing to be at least one which
    implies that also "base" is non-zero.

    This bug has existed since about 2.6.14.

    Cc: stable@kernel.org
    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Adding a spare to a raid10 doesn't cause recovery to start.
    This is due to an silly type in
    commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda
    and so is a bug in 2.6.27 and .28-rc.

    Thanks to Thomas Backlund for bisecting to find this.

    Cc: Thomas Backlund
    Cc: stable@kernel.org

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It turns out that it is only safe to call blkdev_ioctl when the device
    is actually open (as ->bd_disk is set to NULL on last close). And it
    is quite possible for do_md_stop to be called when the device is not
    open. So discard the call to blkdev_ioctl(BLKRRPART) which was
    added in
    commit 934d9c23b4c7e31840a895ba4b7e88d6413c81f3

    It is just as easy to call this ioctl from userspace when needed (on
    mdadm -S) so leave it out of the kernel

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Oct, 2008

1 commit


30 Oct, 2008

3 commits

  • If there are several snapshots sharing an origin and one is removed
    while the origin is being written to, the snapshot's mempool may get
    deleted while elements are still referenced.

    Prior to dm-snapshot-use-per-device-mempools.patch the pending
    exceptions may still have been referenced after the snapshot was
    destroyed, but this was not a problem because the shared mempool
    was still there.

    This patch fixes the problem by tracking the number of mempool elements
    in use.

    The scenario:
    - You have an origin and two snapshots 1 and 2.
    - Someone writes to the origin.
    - It creates two exceptions in the snapshots, snapshot 1 will be primary
    exception, snapshot 2's pending_exception->primary_pe will point to the
    exception in snapshot 1.
    - The exceptions are being relocated, relocation of exception 1 finishes
    (but it's pending_exception is still allocated, because it is referenced
    by an exception from snapshot 2)
    - The user lvremoves snapshot 1 --- it calls just suspend (does nothing)
    and destructor. md->pending is zero (there is no I/O submitted to the
    snapshot by md layer), so it won't help us.
    - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but
    there are none.
    - The destructor on snapshot 1 cleans up everything.
    - The relocation of exception on snapshot 2 finishes, it drops reference
    on primary_pe. This frees its primary_pe pointer. Primary_pe points to
    pending exception created for snapshot 1. So it frees memory into
    non-existing mempool.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • register_snapshot() performs a GFP_KERNEL allocation while holding
    _origins_lock for write, but that could write out dirty pages onto a
    device that attempts to acquire _origins_lock for read, resulting in
    deadlock.

    So move the allocation up before taking the lock.

    This path is not performance-critical, so it doesn't matter that we
    allocate memory and free it if we find that we won't need it.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Missing braces. Commit 1f965b1943 (dm raid1: separate region_hash interface
    part1) broke it.

    Signed-off-by: Ilpo Jarvinen
    Signed-off-by: Alasdair G Kergon
    Cc: Heinz Mauelshagen

    Ilpo Jarvinen
     

28 Oct, 2008

1 commit

  • md arrays are not currently destroyed when they are stopped - they
    remain in /sys/block. Last time I tried this I tripped over locking
    too much.

    A consequence of this is that udev doesn't remove anything from /dev.
    This is rather ugly.

    As an interim measure until proper device removal can be achieved,
    make sure all partitions are removed using the BLKRRPART ioctl, and
    send a KOBJ_CHANGE when an md array is stopped.

    Signed-off-by: NeilBrown

    NeilBrown
     

27 Oct, 2008

1 commit


24 Oct, 2008

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
    [PATCH] kill the rest of struct file propagation in block ioctls
    [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
    [PATCH] get rid of blkdev_locked_ioctl()
    [PATCH] get rid of blkdev_driver_ioctl()
    [PATCH] sanitize blkdev_get() and friends
    [PATCH] remember mode of reiserfs journal
    [PATCH] propagate mode through swsusp_close()
    [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
    [PATCH] pass fmode_t to blkdev_put()
    [PATCH] kill the unused bsize on the send side of /dev/loop
    [PATCH] trim file propagation in block/compat_ioctl.c
    [PATCH] end of methods switch: remove the old ones
    [PATCH] switch sr
    [PATCH] switch sd
    [PATCH] switch ide-scsi
    [PATCH] switch tape_block
    [PATCH] switch dcssblk
    [PATCH] switch dasd
    [PATCH] switch mtd_blkdevs
    [PATCH] switch mmc
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
    [PATCH] fs: add a sanity check in d_free
    [PATCH] i_version: remount support
    [patch] vfs: make security_inode_setattr() calling consistent
    [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
    [PATCH] move executable checking into ->permission()
    [PATCH] fs/dcache.c: update comment of d_validate()
    [RFC PATCH] touch_mnt_namespace when the mount flags change
    [PATCH] reiserfs: add missing llseek method
    [PATCH] fix ->llseek for more directories
    [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
    [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
    [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
    [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
    [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
    [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
    [PATCH] get rid of on-stack dentry in udf
    [PATCH 2/2] anondev: switch to IDA
    [PATCH 1/2] anondev: init IDR statically
    [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
    [PATCH] Optimise NFS readdir hack slightly.
    ...

    Linus Torvalds
     

23 Oct, 2008

1 commit


22 Oct, 2008

14 commits

  • This patch tidies local_init() in preparation for request-based dm.
    No functional change.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary.

    The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
    is never invoked because:
    - 'goto flush_and_out' is the same as 'goto out' because
    the 'goto flush_and_out' is called only when '!noflush'
    - If r is non-zero, then the code above will invoke 'goto out'
    and skip this code.

    No functional change.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • Separate the region hash code from raid1 so it can be shared by forthcoming
    targets. Use BUG_ON() for failed async dm_io() calls.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • When a bio gets split, mark its fragments with the BIO_CLONED flag.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Alasdair G Kergon

    Martin K. Petersen
     
  • Remove waitqueue no longer needed with the async crypto interface.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • When writing io, dm-crypt has to allocate a new cloned bio
    and encrypt the data into newly-allocated pages attached to this bio.
    In rare cases, because of hw restrictions (e.g. physical segment limit)
    or memory pressure, sometimes more than one cloned bio has to be used,
    each processing a different fragment of the original.

    Currently there is one waitqueue which waits for one fragment to finish
    and continues processing the next fragment.

    But when using asynchronous crypto this doesn't work, because several
    fragments may be processed asynchronously or in parallel and there is
    only one crypt context that cannot be shared between the bio fragments.
    The result may be corruption of the data contained in the encrypted bio.

    The patch fixes this by allocating new dm_crypt_io structs (with new
    crypto contexts) and running them independently.

    The fragments contains a pointer to the base dm_crypt_io struct to
    handle reference counting, so the base one is properly deallocated
    after all the fragments are finished.

    In a low memory situation, this only uses one additional object from the
    mempool. If the mempool is empty, the next allocation simple waits for
    previous fragments to complete.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Prepare local sector variable (offset) for later patch.
    Do not update io->sector for still-running I/O.

    No functional change.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Change #include "dm.h" to #include in all targets.
    Targets should not need direct access to internal DM structures.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move array_too_big to include/linux/device-mapper.h because it is
    used by targets.

    Remove the test from dm-raid1 as the number of mirror legs is limited
    such that it can never fail. (Even for stripes it seems rather
    unlikely.)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • We must zero the next chunk on disk *before* writing out the current chunk, not
    after. Otherwise if the machine crashes at the wrong time, the "end of metadata"
    marker may be missing.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Mikulas Patocka
     
  • Use a separate buffer for writing zeroes to the on-disk snapshot
    exception store, make the updating of ps->current_area explicit and
    refactor the code in preparation for the fix in the next patch.

    No functional change.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Mikulas Patocka
    Cc: stable@kernel.org

    Alasdair G Kergon
     
  • The last_percent field is unused - remove it.
    (It dates from when events were triggered as each X% filled up.)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Fix a race condition with primary_pe ref_count handling.

    put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test
    on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count.

    __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding
    dm_snapshot->lock.

    This opens the following race condition:
    Assume two CPUs, CPU1 is executing put_pending_exception (and holding
    dm_snapshot->lock). CPU2 is executing __origin_write in parallel.
    primary_pe->ref_count == 2.

    CPU1:
    if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count))
    origin_bios = bio_list_get(&primary_pe->origin_bios);
    ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios

    CPU2:
    if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
    flush_bios(bio_list_get(&primary_pe->origin_bios));
    free_pending_exception(primary_pe);
    /* If we got here, pe_queue is necessarily empty. */
    return r;
    }
    ... decrements primary_pe->ref_count to 0, submits pending bios, frees
    primary_pe.

    CPU1:
    if (!primary_pe || primary_pe != pe)
    free_pending_exception(pe);
    ... this has no effect.
    if (primary_pe && !atomic_read(&primary_pe->ref_count))
    free_pending_exception(primary_pe);
    ... sees ref_count == 0 (written by CPU 2), does double free !!

    This bug can happen only if someone is simultaneously writing to both the
    origin and the snapshot.

    If someone is writing only to the origin, __origin_write will submit kcopyd
    request after it decrements primary_pe->ref_count (so it can't happen that the
    finished copy races with primary_pe->ref_count decrementation).

    If someone is writing only to the snapshot, __origin_write isn't invoked at all
    and the race can't happen.

    The race happens when someone writes to the snapshot --- this creates
    pending_exception with primary_pe == NULL and starts copying. Then, someone
    writes to the same chunk in the snapshot, and __origin_write races with
    termination of already submitted request in pending_complete (that calls
    put_pending_exception).

    This race may be reason for bugs:
    http://bugzilla.kernel.org/show_bug.cgi?id=11636
    https://bugzilla.redhat.com/show_bug.cgi?id=465825

    The patch fixes the code to make sure that:
    1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process
    must no longer dereference primary_pe (because someone else may free it under
    us).
    2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process
    is responsible for freeing primary_pe.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Mikulas Patocka
     
  • Write throughput to LVM snapshot origin volume is an order
    of magnitude slower than those to LV without snapshots or
    snapshot target volumes, especially in the case of sequential
    writes with O_SYNC on.

    The following patch originally written by Kevin Jamieson and
    Jan Blunck and slightly modified for the current RCs by myself
    tries to improve the performance by modifying the behaviour
    of kcopyd, so that it pushes back an I/O job to the head of
    the job queue instead of the tail as process_jobs() currently
    does when it has to wait for free pages. This way, write
    requests aren't shuffled to cause extra seeks.

    I tested the patch against 2.6.27-rc5 and got the following results.
    The test is a dd command writing to snapshot origin followed by fsync
    to the file just created/updated. A couple of filesystem benchmarks
    gave me similar results in case of sequential writes, while random
    writes didn't suffer much.

    dd if=/dev/zero of= bs=4096 count=...
    [conv=notrunc when updating]

    1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
    average throughput (MB/s)
    10M 100M 1000M
    create,dd 511.46 610.72 11.81
    create,dd+fsync 7.10 6.77 8.13
    update,dd 431.63 917.41 12.75
    update,dd+fsync 7.79 7.43 8.12

    compared with write throughput to LV without any snapshots,
    all dd+fsync and 1000 MiB writes perform very poorly.

    10M 100M 1000M
    create,dd 555.03 608.98 123.29
    create,dd+fsync 114.27 72.78 76.65
    update,dd 152.34 1267.27 124.04
    update,dd+fsync 130.56 77.81 77.84

    2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
    average throughput (MB/s)

    10M 100M 1000M
    create,dd 537.06 589.44 46.21
    create,dd+fsync 31.63 29.19 29.23
    update,dd 487.59 897.65 37.76
    update,dd+fsync 34.12 30.07 26.85

    Although still not on par with plain LV performance -
    cannot be avoided because it's copy on write anyway -
    this simple patch successfully improves throughtput
    of dd+fsync while not affecting the rest.

    Signed-off-by: Jan Blunck
    Signed-off-by: Kazuo Ito
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Kazuo Ito
     

21 Oct, 2008

1 commit