17 Sep, 2010

2 commits

  • If an array with 1.x metadata is assembled with the last disk missing,
    md doesn't properly record the fact that the disk was missing.

    This is unlikely to cause a real problem as the event count will be
    different to the count on the missing disk so it won't be included in
    the array. However it could still cause confusion.

    So make sure we clear all the relevant slots, not just the early ones.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Now that we depend on md_update_sb to clear variable bits in
    mddev->flags (rather than trying not to set them) it is important to
    always call md_update_sb when appropriate.

    md_check_recovery has this job but explicitly avoids it for ->external
    metadata arrays. This is not longer appropraite, or needed.

    However we do want to avoid taking the mddev lock if only
    MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
    external-metadata arrays.

    Reported-by: "Kwolek, Adam"
    Signed-off-by: NeilBrown

    NeilBrown
     

30 Aug, 2010

3 commits

  • MD_CHANGE_CLEAN is used for two different purposes and this leads to
    confusion.
    One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
    not used for anything else, so have MD_CHANGE_PENDING take over that
    purpose fully.

    The two purposes are:
    1/ tell md_update_sb that an update is needed and that it is just a
    clean/dirty transition.
    2/ tell user-space that an transition from clean to dirty is pending
    (something wants to write), and tell te kernel (by clearin the
    flag) that the transition is OK.

    The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
    fully to MD_CHANGE_PENDING.

    This means that various places which conditionally set or cleared
    MD_CHANGE_CLEAN no longer need to be conditional.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If this bit is cleared in md_update_sb() the kernel will allow writes to the
    array if userspace triggers md_allow_write(), e.g. through stripe_cache_size,
    when mdmon is not active. When mdmon is active the array transitions to
    active-idle bypassing write-pending, setting up a race for mdmon to set the
    array clean before a write arrives.

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • Another missing bit of the raid6 -> /lib move.

    Reported-by: Andreas Schwab
    Signed-off-by: NeilBrown

    NeilBrown
     

18 Aug, 2010

4 commits

  • commit 7b6d91daee5cac6402186ff224c3af39d79f4a0e changed the behaviour
    of a few variables in raid1 and raid10 from flags to bit-sets, but
    left them as type 'bool' so they did not work.

    Change them (back) to unsigned long.
    (historical note: see 1ef04fefe2241087d9db7e9615c3f11b516e36cf)

    Signed-off-by: NeilBrown
    Reported-by: Jiri Slaby and many others

    NeilBrown
     
  • md_check_recovery expects ->spare_active to return 'true' if any
    spares were activated, but none of them do, so the consequent change
    in 'degraded' is not notified through sysfs.

    So count the number of spares activated, subtract it from 'degraded'
    just once, and return it.

    Reported-by: Adrian Drzewiecki
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When RAID1 is done syncing disks, it'll update the state
    of synced rdevs to In_sync. But it neglected to notify
    sysfs that the attribute changed. So any programs that
    are waiting for an rdev's state to change will not be
    woken.

    (raid5/raid10 added by neilb)

    Signed-off-by: Adrian Drzewiecki
    Signed-off-by: NeilBrown

    Adrian Drzewiecki
     
  • The update of ->recovery_offset in sync_sbs is appropriate even then external
    metadata is in use. However sync_sbs is only called when native
    metadata is used.

    So move that update in to the top of md_update_sb (which is the only
    caller of sync_sbs) before the test on ->external.

    This moves the update out of ->write_lock protection, but those fields
    only need ->reconfig_mutex protection which they still have.

    Also move the test on ->persistent up to where ->external is set as
    for metadata update purposes they are the same.

    Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing
    if ->external is set or ->persistent isn't.

    Finally move the update of ->utime down as it is only relevent (like
    the ->events update) for native metadata.

    Signed-off-by: NeilBrown
    Reported-by: "Kwolek, Adam"

    NeilBrown
     

12 Aug, 2010

31 commits

  • Enable discard support in the DM multipath target.

    This discard support depends on a few discard-specific fixes to the
    block layer's request stacking driver methods.

    Discard requests are optional so don't allow a failed discard to trigger
    path failures. If there is a real problem with a given path the
    barriers associated with the discard (either before or after the
    discard) will cause path failure. That said, unconditionally passing
    discard failures up the stack is not ideal. This must be fixed once DM
    has more information about the nature of the underlying storage failure.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: Kiyoshi Ueda

    Mike Snitzer
     
  • The DM core will submit a discard bio to the stripe target for each
    stripe in a striped DM device. The stripe target will determine
    stripe-specific portions of the supplied bio to be remapped into
    individual (at most 'num_discard_requests' extents). If a given
    stripe-specific discard bio doesn't touch a particular stripe the bio
    will be dropped.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Update __clone_and_map_discard to loop across all targets in a DM
    device's table when it processes a discard bio. If a discard crosses a
    target boundary it must be split accordingly.

    Update __issue_target_requests and __issue_target_request to allow a
    cloned discard bio to have a custom start sector and size.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Optimize sector division: If the number of stripes is a power of two,
    we can do shift and mask instead of division.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move sector to stripe translation into a function.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Have the error target respond to a discard request with a hard -EIO
    rather than fail the request with -EOPNOTSUPP.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Enable discard support for the delay target.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Have the zero target silently drop a discard rather than fail the
    request with -EOPNOTSUPP.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Use new dm_target_offset() macro to avoid most references to ti->begin
    in dm targets.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Split max_io_len_target_boundary out of max_io_len so that the discard
    support can make use of it without duplicating max_io_len code.

    Avoiding max_io_len's split_io logic enables DM's discard support to
    submit the entire discard request to a target. But discards must still
    be split on target boundaries.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Rename __flush_target to __issue_target_request now that it is used to
    issue both flush and discard requests.

    Introduce __issue_target_requests as a convenient wrapper to
    __issue_target_request 'num_flush_requests' or 'num_discard_requests'
    times per target.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Allow discards to be passed through to linear mappings if at least one
    underlying device supports it. Discards will be forwarded only to
    devices that support them.

    A target that supports discards should set num_discard_requests to
    indicate how many times each discard request must be submitted to it.

    Verify table's underlying devices support discards prior to setting the
    associated DM device as capable of discards (via QUEUE_FLAG_DISCARD).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Allocate cipher strings indpendently of struct crypt_config and move
    cipher parsing and allocation into a separate function to prepare for
    supporting the cryptoapi format e.g. "xts(aes)".

    No functional change in this patch.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Use just one label and reuse common destructor for crypt target.

    Parse remaining argv arguments in logic order.

    Also do not ignore error values from IV init and set key functions.

    No functional change in this patch except changed return codes
    based on above.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Add devname:mapper/control and MAPPER_CTRL_MINOR module alias
    to support dm-mod module autoloading.

    Signed-off-by: Kay Sievers
    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • 'target_request_nr' is a more generic name that reflects the fact that
    it will be used for both flush and discard support.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • This change unifies the various checks and finalization that occurs on a
    table prior to use. By doing so, it allows table construction without
    traversing the dm-ioctl interface.

    Signed-off-by: Will Drewry
    Signed-off-by: Alasdair G Kergon

    Will Drewry
     
  • Implement merge method for the snapshot origin to improve read
    performance.

    Without merge method, dm asks the upper layers to submit smallest possible
    bios --- one page. Submitting such small bios impacts performance negatively
    when reading or writing the origin device.

    Without this patch, CPU consumption when reading the origin on lvm on md-raid0
    was 6 to 12%, with this patch, it drops to 1 to 4%.

    Note: in my testing, it actually degraded performance in some settings, I
    traced it to Maxtor disks having problems with > 512-sector requests.
    Reducing the number of sectors to /sys/block/sd*/queue/max_sectors_kb to
    256 fixed the read performance. I think we don't have to care about weird
    disks that actually degrade performance because of large requests being
    sent to them.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Change bio-based mapped devices no longer to have a fully initialized
    request_queue (request_fn, elevator, etc). This means bio-based DM
    devices no longer register elevator sysfs attributes ('iosched/' tree
    or 'scheduler' other than "none").

    In contrast, a request-based DM device will continue to have a full
    request_queue and will register elevator sysfs attributes. Therefore
    a user can determine a DM device's type by checking if elevator sysfs
    attributes exist.

    First allocate a minimalist request_queue structure for a DM device
    (needed for both bio and request-based DM).

    Initialization of a full request_queue is deferred until it is known
    that the DM device is request-based, at the end of the table load
    sequence.

    Factor DM device's request_queue initialization:
    - common to both request-based and bio-based into dm_init_md_queue().
    - specific to request-based into dm_init_request_based_queue().

    The md->type_lock mutex is used to protect md->queue, in addition to
    md->type, during table_load().

    A DM device's first table_load will establish the immutable md->type.
    But md->queue initialization, based on md->type, may fail at that time
    (because blk_init_allocated_queue cannot allocate memory). Therefore
    any subsequent table_load must (re)try dm_setup_md_queue independently of
    establishing md->type.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Determine whether a mapped device is bio-based or request-based when
    loading its first (inactive) table and don't allow that to be changed
    later.

    This patch performs different device initialisation in each of the two
    cases. (We don't think it's necessary to add code to support changing
    between the two types.)

    Allowed md->type transitions:
    DM_TYPE_NONE to DM_TYPE_BIO_BASED
    DM_TYPE_NONE to DM_TYPE_REQUEST_BASED

    We now prevent table_load from replacing the inactive table with a
    conflicting type of table even after an explicit table_clear.

    Introduce 'type_lock' into the struct mapped_device to protect md->type
    and to prepare for the next patch that will change the queue
    initialization and allocate memory while md->type_lock is held.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    drivers/md/dm-ioctl.c | 15 +++++++++++++++
    drivers/md/dm.c | 37 ++++++++++++++++++++++++++++++-------
    drivers/md/dm.h | 5 +++++
    include/linux/dm-ioctl.h | 4 ++--
    4 files changed, 52 insertions(+), 9 deletions(-)

    Mike Snitzer
     
  • When processing barriers, skip the second flush if processing the bio
    failed with -EOPNOTSUPP. This can happen with discard+barrier requests.
    If the device doesn't support discard, there would be two useless
    SYNCHRONIZE CACHE commands. The first dm_flush cannot be so easily
    optimized out, so we leave it there.

    Previously, -EOPNOTSUPP could be received in dec_pending only with empty
    barriers and we ignored that error, assuming the device not supporting
    cache flushes has cache always consistent. With the addition of discard
    barriers, this -EOPNOTSUPP can also be generated by discards and we
    must record it in md->barrier_error for process_barrier.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch fixes hard-coded value for the size of a chunk that includes
    disk header for persistent snapshot. It should be changed to existing
    macro NUM_SNAPSHOT_HDR_CHUNKS instead of using hard-coded value 1.

    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Tomohiro Kusumi
     
  • Use kstrdup when the goal of an allocation is copy a string into the
    allocated region.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression from,to;
    expression flag,E1,E2;
    statement S;
    @@

    - to = kmalloc(strlen(from) + 1,flag);
    + to = kstrdup(from, flag);
    ... when != \(from = E1 \| to = E1 \)
    if (to==NULL || ...) S
    ... when != \(from = E2 \| to = E2 \)
    - strcpy(to, from);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Alasdair G Kergon

    Julia Lawall
     
  • The dm control device does not implement read/write, so it has no use for
    seeking. Using no_llseek prevents falling back to default_llseek, which
    requires the BKL.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Alasdair G Kergon

    Arnd Bergmann
     
  • This patch separates the device deletion code from dm_put()
    to make sure the deletion happens in the process context.

    By this patch, device deletion always occurs in an ioctl (process)
    context and dm_put() can be called in interrupt context.
    As a result, the request-based dm's bad dm_put() usage pointed out
    by Mikulas below disappears.
    http://marc.info/?l=dm-devel&m=126699981019735&w=2

    Without this patch, I confirmed there is a case to crash the system:
    dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())

    Some more backgrounds and details:
    In request-based dm, a device opener can remove a mapped_device
    while the last request is still completing, because bios in the last
    request complete first and then the device opener can close and remove
    the mapped_device before the last request completes:
    CPU0 CPU1
    =================================================================
    <>
    blk_end_request_all(clone_rq)
    blk_update_request(clone_rq)
    bio_endio(clone_bio) == end_clone_bio
    blk_update_request(orig_rq)
    bio_endio(orig_bio)
    <>
    dm_blk_close()
    dev_remove()
    dm_put(md)
    <>
    blk_finish_request(clone_rq)
    ....
    dm_end_request(clone_rq)
    free_rq_clone(clone_rq)
    blk_end_request_all(orig_rq)
    rq_completed(md)

    So request-based dm used dm_get()/dm_put() to hold md for each I/O
    until its request completion handling is fully done.
    However, the final dm_put() can call the device deletion code which
    must not be run in interrupt context and may cause kernel panic.

    To solve the problem, this patch moves the device deletion code,
    dm_destroy(), to predetermined places that is actually deleting
    the mapped_device in ioctl (process) context, and changes dm_put()
    just to decrement the reference count of the mapped_device.
    By this change, dm_put() can be used in any context and the symmetric
    model below is introduced:
    dm_create(): create a mapped_device
    dm_destroy(): destroy a mapped_device
    dm_get(): increment the reference count of a mapped_device
    dm_put(): decrement the reference count of a mapped_device

    dm_destroy() waits for all references of the mapped_device to disappear,
    then deletes the mapped_device.

    dm_destroy() uses active waiting with msleep(1), since deleting
    the mapped_device isn't performance-critical task.
    And since at this point, nobody opens the mapped_device and no new
    reference will be taken, the pending counts are just for racing
    completing activity and will eventually decrease to zero.

    For the unlikely case of the forced module unload, dm_destroy_immediate(),
    which doesn't wait and forcibly deletes the mapped_device, is also
    introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
    may be stuck and never return.
    And now, because the mapped_device is deleted at this point, subsequent
    accesses to the mapped_device may cause NULL pointer references.

    Cc: stable@kernel.org
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch changes dm_hash_remove_all() to release _hash_lock when
    removing a device. After removing the device, dm_hash_remove_all()
    takes _hash_lock and searches the hash from scratch again.

    This patch is a preparation for the next patch, which changes device
    deletion code to wait for md reference to be 0. Without this patch,
    the wait in the next patch may cause AB-BA deadlock:
    CPU0 CPU1
    -----------------------------------------------------------------------
    dm_hash_remove_all()
    down_write(_hash_lock)
    table_status()
    md = find_device()
    dm_get(md)
    holders>
    dm_get_live_or_inactive_table()
    dm_get_inactive_table()
    down_write(_hash_lock)

    holders to be 0>

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch prevents access to mapped_device which is being deleted.

    Currently, even after a mapped_device has been removed from the hash,
    it could be accessed through idr_find() using minor number.
    That could cause a race and NULL pointer reference below:
    CPU0 CPU1
    ------------------------------------------------------------------
    dev_remove(param)
    down_write(_hash_lock)
    dm_lock_for_deletion(md)
    spin_lock(_minor_lock)
    set_bit(DMF_DELETING)
    spin_unlock(_minor_lock)
    __hash_remove(hc)
    up_write(_hash_lock)
    dev_status(param)
    md = find_device(param)
    down_read(_hash_lock)
    __find_device_hash_cell(param)
    dm_get_md(param->dev)
    md = dm_find_md(dev)
    spin_lock(_minor_lock)
    md = idr_find(MINOR(dev))
    spin_unlock(_minor_lock)
    dm_put(md)
    free_dev(md)
    dm_get(md)
    up_read(_hash_lock)
    __dev_status(md, param)
    dm_put(md)

    This patch fixes such problems.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so
    that userspace knows whether or not to wait for a uevent to be processed
    before continuing,

    The dm rename ioctl sets this flag but was not structured to return it
    to userspace. This patch restructures the rename ioctl processing to
    behave like the other ioctls that return data and so fix this.

    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • __dev_status() cannot fail so make it void and simplify callers.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Remove useless __dev_status call while processing an ioctl that sets up
    device geometry and target message. The data is not returned to
    userspace so there is no point collecting it and in the case of
    target_message it is collected before processing the message so if it
    did return it might be stale.

    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • Validate chunk size against both origin and snapshot sector size

    Don't allow chunk size smaller than either origin or snapshot logical
    sector size. Reading or writing data not aligned to sector size is not
    allowed and causes immediate errors.

    This requires us to open the origin before initialising the
    exception store and to export dm_snap_origin.

    Cc: stable@kernel.org
    Signed-off-by: Mikulas Patocka
    Reviewed-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka