24 Oct, 2011

1 commit

  • Fix memory leak introduced by commit a6e50b409d3f9e0833e69c3c9cca822e8fa4adbb
    (dm snapshot: skip reading origin when overwriting complete chunk).

    When allocating a set of jobs from kc->job_pool, job->master_job must be
    set (to point to itself) so that the mempool item gets freed when the
    master_job completes.

    master_job was introduced by commit c6ea41fbbe08f270a8edef99dc369faf809d1bd6
    (dm kcopyd: preallocate sub jobs to avoid deadlock)

    Reported-by: Michael Leun
    Cc: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

06 Oct, 2011

1 commit


26 Sep, 2011

4 commits

  • If optional discard support in dm-crypt is enabled, discards requests
    bypass the crypt queue and blocks of the underlying device are discarded.
    For the read path, discarded blocks are handled the same as normal
    ciphertext blocks, thus decrypted.

    So if the underlying device announces discarded regions return zeroes,
    dm-crypt must disable this flag because after decryption there is just
    random noise instead of zeroes.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Fix off-by-one error in validation of write_mostly.

    The user-supplied value given for the 'write_mostly' argument must be an
    index starting at 0. The validation of the supplied argument failed to
    check for 'N' ('>' vs '>='), which would have caused an access beyond the
    end of the array.

    Reported-by: Doug Ledford
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonthan Brassow
     
  • Commit a63a5cf (dm: improve block integrity support) introduced a
    two-phase initialization of a DM device's integrity profile. This
    patch avoids dereferencing a NULL 'template_disk' pointer in
    blk_integrity_register() if there is an integrity profile mismatch in
    dm_table_set_integrity().

    This can occur if the integrity profiles for stacked devices in a DM
    table are changed between the call to dm_table_prealloc_integrity() and
    dm_table_set_integrity().

    Reported-by: Zdenek Kabelac
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org # 2.6.39

    Mike Snitzer
     
  • If no arguments were provided to the corrupt_bio_byte feature an error
    should be returned immediately.

    Reported-by: Zdenek Kabelac
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

21 Sep, 2011

1 commit

  • Two related problems:

    1/ some error paths call "md_unregister_thread(mddev->thread)"
    without subsequently clearing ->thread. A subsequent call
    to mddev_unlock will try to wake the thread, and crash.

    2/ Most calls to md_wakeup_thread are protected against the thread
    disappeared either by:
    - holding the ->mutex
    - having an active request, so something else must be keeping
    the array active.
    However mddev_unlock calls md_wakeup_thread after dropping the
    mutex and without any certainty of an active request, so the
    ->thread could theoretically disappear.
    So we need a spinlock to provide some protections.

    So change md_unregister_thread to take a pointer to the thread
    pointer, and ensure that it always does the required locking, and
    clears the pointer properly.

    Reported-by: "Moshe Melnikov"
    Signed-off-by: NeilBrown
    cc: stable@kernel.org

    NeilBrown
     

10 Sep, 2011

3 commits

  • 0.90 metadata uses an unsigned 32bit number to count the number of
    kilobytes used from each device.
    This should allow up to 4TB per device.
    However we multiply this by 2 (to get sectors) before casting to a
    larger type, so sizes above 2TB get truncated.

    Also we allow rdev->sectors to be larger than 4TB, so it is possible
    for the array to be resized larger than the metadata can handle.
    So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
    used.

    Also the sanity check at the end of super_90_load should include level
    1 as it used ->size too. (RAID0 and Linear don't use ->size at all).

    Reported-by: Pim Zandbergen
    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     
  • A single request to RAID1 or RAID10 might result in multiple
    requests if there are known bad blocks that need to be avoided.

    To detect if we need to submit another write request we test:
    if (sectors_handled < (bio->bi_size >> 9)) {

    However this is after we call **_write_done() so the 'bio' no longer
    belongs to us - the writes could have completed and the bio freed.

    So move the **_write_done call until after the test against
    bio->bi_size.

    This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

    Reported-by: Bruno Wolff III
    Tested-by: Bruno Wolff III
    Signed-off-by: NeilBrown

    NeilBrown
     
  • A write can complete at two different places:
    1/ when the last member-device write completes, through
    raid10_end_write_request
    2/ in make_request() when we remove the initial bias from ->remaining.

    These two should do exactly the same thing and the comment says they
    do, but they don't.

    So factor the correct code out into a function and call it in both
    places. This makes the code much more similar to RAID1.

    The difference is only significant if there is an error, and they
    usually take a while, so it is unlikely that there will be an error
    already when make_request is completing, so this is unlikely to cause
    real problems.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Aug, 2011

1 commit

  • Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
    cannot work with internal metadata as it is the raid5d thread which
    will clear the blocked flag.
    This wasn't a problem in 3.0 and earlier as we only set the blocked
    flag when external metadata was used then.
    However we now set it always, so we need to be more careful.

    Signed-off-by: NeilBrown

    NeilBrown
     

30 Aug, 2011

1 commit


25 Aug, 2011

4 commits


02 Aug, 2011

24 commits

  • DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
    regardless of whether or not a given DM device's underlying devices
    also advertised a need for them.

    Block's flush-merge changes from 2.6.39 have proven to be more costly
    for DM devices. Performance regressions have been reported even when
    DM's underlying devices do not advertise that they have a write cache.

    Fix the performance regressions by configuring a DM device's flushing
    capabilities based on those of the underlying devices' capabilities.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add optional parameter field to dmcrypt table and support
    "allow_discards" option.

    Discard requests bypass crypt queue processing. Bio is simple remapped
    to underlying device.

    Note that discard will be never enabled by default because of security
    consequences. It is up to the administrator to enable it for encrypted
    devices.

    (Note that userspace cryptsetup does not understand new optional
    parameters yet. Support for this will come later. Until then, you
    should use 'dmsetup' to enable and disable this.)

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Support the MD RAID1 personality through dm-raid.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Add the ability to parse and use metadata devices to dm-raid. Although
    not strictly required, without the metadata devices, many features of
    RAID are unavailable. They are used to store a superblock and bitmap.

    The role, or position in the array, of each device must be recorded in
    its superblock. This is to help with fault handling, array reshaping,
    and sanity checks. RAID 4/5/6 devices must be loaded in a specific order:
    in this way, the 'array_position' field helps validate the correctness
    of the mapping when it is loaded. It can be used during reshaping to
    identify which devices are added/removed. Fault handling is impossible
    without this field. For example, when a device fails it is recorded in
    the superblock. If this is a RAID1 device and the offending device is
    removed from the array, there must be a way during subsequent array
    assembly to determine that the failed device was the one removed. This
    is done by correlating the 'array_position' field and the bit-field
    variable 'failed_devices'.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Add the write_mostly parameter to RAID1 dm-raid tables.

    This allows the user to set the WriteMostly flag on a RAID1 device that
    should normally be avoided for read I/O.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Allow the user to specify the region_size.

    Ensures that the supplied value meets md's constraints, viz. the number of
    regions does not exceed 2^21.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Exactly one of name, uuid or device must be specified when referencing
    an existing device. This removes the ambiguity (risking the wrong
    device being updated) if two conflicting parameters were specified.
    Previously one parameter got used and any others were ignored silently.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move logic to find device based on major/minor number to a separate
    function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell).
    This makes the function __find_device_hash_cell more straightforward.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move parameter filling from find_device to __find_device_hash_cell.

    This patch causes ioctls using __find_device_hash_cell
    (DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD)
    to return device parameters, bringing them into line with the other
    ioctls.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a
    specified position with a specified value during intervals when the device is
    "down".

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add 'drop_writes' option to drop writes silently while the
    device is 'down'. Reads are not touched.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add the ability to specify arbitrary feature flags when creating a
    flakey target. This code uses the same target argument helpers that
    the multipath target does.

    Also remove the superfluous 'dm-flakey' prefixes from the error messages,
    as they already contain the prefix 'flakey'.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Use dm_target_offset() and support discards.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Move multipath target argument parsing code into dm-table so other
    targets can share it.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • If we write a full chunk in the snapshot, skip reading the origin device
    because the whole chunk will be overwritten anyway.

    This patch changes the snapshot write logic when a full chunk is written.
    In this case:
    1. allocate the exception
    2. dispatch the bio (but don't report the bio completion to device mapper)
    3. write the exception record
    4. report bio completed

    Callbacks must be done through the kcopyd thread, because callbacks must not
    race with each other. So we create two new functions:

    dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback.
    (This function must not be called from interrupt context.)

    dm_kcopyd_do_callback: submit callback.
    (This function may be called from interrupt context.)

    Performance test (on snapshots with 4k chunk size):
    without the patch:
    non-direct-io sequential write (dd): 17.7MB/s
    direct-io sequential write (dd): 20.9MB/s
    non-direct-io random write (mkfs.ext2): 0.44s

    with the patch:
    non-direct-io sequential write (dd): 26.5MB/s
    direct-io sequential write (dd): 33.2MB/s
    non-direct-io random write (mkfs.ext2): 0.27s

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
    whether the device can accept bios larger than the size its merge
    function returns. When set, use this to send large bios to snapshots
    which can split them if necessary. Snapshot I/O may be significantly
    fragmented and this approach seems to improve peformance.

    Before the patch, dm_set_device_limits restricted bio size to page size
    if the underlying device had a merge function and the target didn't
    provide a merge function. After the patch, dm_set_device_limits
    restricts bio size to page size if the underlying device has a merge
    function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
    provide a merge function.

    The snapshot target can't provide a merge function because when the merge
    function is called, it is impossible to determine where the bio will be
    remapped. Previously this led us to impose a 4k limit, which we can
    now remove if the snapshot store is located on a device without a merge
    function. Together with another patch for optimizing full chunk writes,
    it improves performance from 29MB/s to 40MB/s when writing to the
    filesystem on snapshot store.

    If the snapshot store is placed on a non-dm device with a merge function
    (such as md-raid), device mapper still limits all bios to page size.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • There is no need for __table_get_device to be factored out.
    Also move the exports to the end of their respective functions.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • A dm target only needs to use include/linux dm headers.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Detect invalid empty messages in core dm instead of requiring every target to
    check this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Re-order the parameters so they are handled consistently in the same order
    where defined, parsed and output.

    Only include rebuild parameters in the STATUSTYPE_TABLE output if they were
    supplied in the original table line.

    Correct the parameter count when outputting rebuild: there are two words,
    not one.

    Use case-independent checks for keywords (as in other device-mapper targets).

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • Coding style cleanups.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Jonathan Brassow

    Jonathan Brassow
     
  • Remove a couple of unused #defines.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The nr_pages field in struct kcopyd_job is only used temporarily in
    run_pages_job() to count the number of required pages.
    We can use a local variable instead.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • The offset field in struct kcopyd_job is always zero so remove it.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka