06 Oct, 2014

1 commit

  • Until this change, when loading a new DM table, DM core would re-open
    all of the devices in the DM table. Now, DM core will avoid redundant
    device opens (and closes when destroying the old table) if the old
    table already has a device open using the same mode. This is achieved
    by managing reference counts on the table_devices that DM core now
    stores in the mapped_device structure (rather than in the dm_table
    structure). So a mapped_device's active and inactive dm_tables' dm_dev
    lists now just point to the dm_devs stored in the mapped_device's
    table_devices list.

    This improvement in DM core's device reference counting has the
    side-effect of fixing a long-standing limitation of the multipath
    target: a DM multipath table couldn't include any paths that were unusable
    (failed). For example: if all paths have failed and you add a new,
    working, path to the table; you can't use it since the table load would
    fail due to it still containing failed paths. Now a re-load of a
    multipath table can include failed devices and when those devices become
    active again they can be used instantly.

    The device list code in dm.c isn't a straight copy/paste from the code in
    dm-table.c, but it's very close (aside from some variable renames). One
    subtle difference is that find_table_device for the tables_devices list
    will only match devices with the same name and mode. This is because we
    don't want to upgrade a device's mode in the active table when an
    inactive table is loaded.

    Access to the mapped_device structure's tables_devices list requires a
    mutex (tables_devices_lock), so that tables cannot be created and
    destroyed concurrently.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Mike Snitzer

    Benjamin Marzinski
     

10 Nov, 2013

1 commit

  • This patch allows the removal of an open device to be deferred until
    it is closed. (Previously such a removal attempt would fail.)

    The deferred remove functionality is enabled by setting the flag
    DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or
    DM_REMOVE_ALL ioctl.

    On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if
    the device was removed immediately or flagged to be removed on close -
    if the flag is clear, the device was removed.

    On return from DM_DEV_STATUS and other ioctls, the flag
    DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on
    closure.

    A device that is scheduled to be deleted can be revived using the
    message "@cancel_deferred_remove". This message clears the
    DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

06 Sep, 2013

4 commits

  • Support the collection of I/O statistics on user-defined regions of
    a DM device. If no regions are defined no statistics are collected so
    there isn't any performance impact. Only bio-based DM devices are
    currently supported.

    Each user-defined region specifies a starting sector, length and step.
    Individual statistics will be collected for each step-sized area within
    the range specified.

    The I/O statistics counters for each step-sized area of a region are
    in the same format as /sys/block/*/stat or /proc/diskstats but extra
    counters (12 and 13) are provided: total time spent reading and
    writing in milliseconds. All these counters may be accessed by sending
    the @stats_print message to the appropriate DM device via dmsetup.

    The creation of DM statistics will allocate memory via kmalloc or
    fallback to using vmalloc space. At most, 1/4 of the overall system
    memory may be allocated by DM statistics. The admin can see how much
    memory is used by reading
    /sys/module/dm_mod/parameters/stats_current_allocated_bytes

    See Documentation/device-mapper/statistics.txt for more details.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Make use of common cleanup code.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Hold the mapped device's type_lock before calling populate_table() since
    it is where the table's type is determined based on the specified
    targets. There is no need to allow concurrent table loads to race to
    establish the table's targets or type.

    This eliminates the need to grab the lock in dm_table_set_type().

    Also verify that the type_lock is held in both dm_set_md_type() and
    dm_get_md_type().

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • A device-mapper device must always have a name consisting of a non-empty
    string. If the device also has a uuid, this similarly must not be an
    empty string.

    The DM_DEV_CREATE ioctl enforces these rules when the device is created,
    but this patch is needed to enforce them when DM_DEV_RENAME is used to
    change the name or uuid.

    Reported-by: Zdenek Kabelac
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Mike Snitzer
    Acked-by: Mikulas Patocka

    Alasdair Kergon
     

11 Jul, 2013

3 commits

  • This patch removes "io_lock" and "map_lock" in struct mapped_device and
    "holders" in struct dm_table and replaces these mechanisms with
    sleepable-rcu.

    Previously, the code would call "dm_get_live_table" and "dm_table_put" to
    get and release table. Now, the code is changed to call "dm_get_live_table"
    and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
    dm_put_live_table unlocks it.

    dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
    dm_get_live_table/dm_put_live_table. These *_fast functions use
    non-sleepable RCU, so the caller must not block between them.

    If the code changes active or inactive dm table, it must call
    dm_sync_table before destroying the old table.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use __GFP_HIGHMEM in __vmalloc.

    Pages allocated with __vmalloc can be allocated in high memory that is not
    directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
    does. This patch reduces memory pressure slightly because pages can be
    allocated in the high zone.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Set noio flag while calling __vmalloc() because it doesn't fully respect
    gfp flags to avoid a possible deadlock (see commit
    502624bdad3dba45dfaacaf36b7d83e39e74b2d2).

    This should be backported to stable kernels 3.8 and newer. The kernel 3.8
    doesn't have memalloc_noio_save(), so we should set and restore process
    flag PF_MEMALLOC instead.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

02 Mar, 2013

4 commits

  • This patch introduces enhanced message support that allows the
    device-mapper core to recognise messages that are common to all devices,
    and for messages to return data to userspace.

    Core messages are processed by the function "message_for_md". If the
    device mapper doesn't support the message, it is passed to the target
    driver.

    If the message returns data, the kernel sets the flag
    DM_MESSAGE_OUT_FLAG.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Device-mapper ioctls receive and send data in a buffer supplied
    by userspace. The buffer has two parts. The first part contains
    a 'struct dm_ioctl' and has a fixed size. The second part depends
    on the ioctl and has a variable size.

    This patch recognises the specific ioctls that do not use the variable
    part of the buffer and skips allocating memory for it.

    In particular, when a device is suspended and a resume ioctl is sent,
    this now avoid memory allocation completely.

    The variable "struct dm_ioctl tmp" is moved from the function
    copy_params to its caller ctl_ioctl and renamed to param_kernel.
    It is used directly when the ioctl function doesn't need any arguments.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch introduces flags for each ioctl function.

    So far, one flag is defined, IOCTL_FLAGS_NO_PARAMS. It is set if the
    function processing the ioctl doesn't take or produce any parameters in
    the section of the data buffer that has a variable size.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Avoid returning a truncated table or status string instead of setting
    the DM_BUFFER_FULL_FLAG when the last target of a table fills the
    buffer.

    When processing a table or status request, the function retrieve_status
    calls ti->type->status. If ti->type->status returns non-zero,
    retrieve_status assumes that the buffer overflowed and sets
    DM_BUFFER_FULL_FLAG.

    However, targets don't return non-zero values from their status method
    on overflow. Most targets returns always zero.

    If a buffer overflow happens in a target that is not the last in the
    table, it gets noticed during the next iteration of the loop in
    retrieve_status; but if a buffer overflow happens in the last target, it
    goes unnoticed and erroneously truncated data is returned.

    In the current code, the targets behave in the following way:
    * dm-crypt returns -ENOMEM if there is not enough space to store the
    key, but it returns 0 on all other overflows.
    * dm-thin returns errors from the status method if a disk error happened.
    This is incorrect because retrieve_status doesn't check the error
    code, it assumes that all non-zero values mean buffer overflow.
    * all the other targets always return 0.

    This patch changes the ti->type->status function to return void (because
    most targets don't use the return code). Overflow is detected in
    retrieve_status: if the status method fills up the remaining space
    completely, it is assumed that buffer overflow happened.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

22 Dec, 2012

3 commits

  • If the parameter buffer is small enough, try to allocate it with kmalloc()
    rather than vmalloc().

    vmalloc is noticeably slower than kmalloc because it has to manipulate
    page tables.

    In my tests, on PA-RISC this patch speeds up activation 13 times.
    On Opteron this patch speeds up activation by 5%.

    This patch introduces a new function free_params() to free the
    parameters and this uses new flags that record whether or not vmalloc()
    was used and whether or not the input buffer must be wiped after use.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • When allocating memory for the userspace ioctl data, set some
    appropriate GPF flags directly instead of using PF_MEMALLOC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Abort dm ioctl processing if userspace changes the data_size parameter
    after we validated it but before we finished copying the data buffer
    from userspace.

    The dm ioctl parameters are processed in the following sequence:
    1. ctl_ioctl() calls copy_params();
    2. copy_params() makes a first copy of the fixed-sized portion of the
    userspace parameters into the local variable "tmp";
    3. copy_params() then validates tmp.data_size and allocates a new
    structure big enough to hold the complete data and copies the whole
    userspace buffer there;
    4. ctl_ioctl() reads userspace data the second time and copies the whole
    buffer into the pointer "param";
    5. ctl_ioctl() reads param->data_size without any validation and stores it
    in the variable "input_param_size";
    6. "input_param_size" is further used as the authoritative size of the
    kernel buffer.

    The problem is that userspace code could change the contents of user
    memory between steps 2 and 4. In particular, the data_size parameter
    can be changed to an invalid value after the kernel has validated it.
    This lets userspace force the kernel to access invalid kernel memory.

    The fix is to ensure that the size has not changed at step 4.

    This patch shouldn't have a security impact because CAP_SYS_ADMIN is
    required to run this code, but it should be fixed anyway.

    Reported-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Cc: stable@kernel.org

    Alasdair G Kergon
     

27 Jul, 2012

1 commit

  • Commit outstanding metadata before returning the status for a dm thin
    pool so that the numbers reported are as up-to-date as possible.

    The commit is not performed if the device is suspended or if
    the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
    through a new 'status_flags' parameter in the target's dm_status_fn.

    The userspace dmsetup tool will support the --noflush flag with the
    'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
    onwards.

    Tested-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

29 Mar, 2012

1 commit

  • Device mapper uses sscanf to convert arguments to numbers. The problem is that
    the way we use it ignores additional unmatched characters in the scanned string.

    For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
    but also it will match number with some garbage appended, like "123abc".

    As a result, device mapper accepts garbage after some numbers. For example
    the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
    will pass without an error.

    This patch fixes all sscanf uses in device mapper. It appends "%c" with
    a pointer to a dummy character variable to every sscanf statement.

    The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
    only if string is a null-terminated number (optionally preceded by some
    whitespace characters). If there is some character appended after the number,
    sscanf matches "%c", writes the character to the dummy variable and returns 2.
    We check the return value for 1 and consequently reject numbers with some
    garbage appended.

    Signed-off-by: Mikulas Patocka
    Acked-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

08 Mar, 2012

1 commit


01 Nov, 2011

1 commit

  • Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
    with any other target type, and once loaded into a device, it cannot be
    replaced with a table containing a different type.

    The thin provisioning pool device will use this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

02 Aug, 2011

4 commits

  • Exactly one of name, uuid or device must be specified when referencing
    an existing device. This removes the ambiguity (risking the wrong
    device being updated) if two conflicting parameters were specified.
    Previously one parameter got used and any others were ignored silently.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move logic to find device based on major/minor number to a separate
    function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell).
    This makes the function __find_device_hash_cell more straightforward.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move parameter filling from find_device to __find_device_hash_cell.

    This patch causes ioctls using __find_device_hash_cell
    (DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD)
    to return device parameters, bringing them into line with the other
    ioctls.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Detect invalid empty messages in core dm instead of requiring every target to
    check this.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

24 Mar, 2011

2 commits

  • Add DM_SECURE_DATA_FLAG which userspace can use to ensure
    that all buffers allocated for dm-ioctl are wiped
    immediately after use.

    The user buffer is wiped as well (we do not want to keep
    and return sensitive data back to userspace if the flag is set).

    Wiping is useful for cryptsetup to ensure that the key
    is present in memory only in defined places and only
    for the time needed.

    (For crypt, key can be present in table during load or table
    status, wait and message commands).

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Prepare code for implementing buffer wipe flag.
    No functional change in this patch.

    Signed-off-by: Milan Broz
    Acked-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     

14 Jan, 2011

2 commits

  • The device-mapper should not send warning messages to syslog
    if a device is not found. This can be done by userspace
    according to the returned dm-ioctl error code.

    So move these messages to debug level and use rate limiting
    to not flood syslog.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Allow the uuid of a mapped device to be set after device creation.
    Previously the uuid (which is optional) could only be set by
    DM_DEV_CREATE. If no uuid was supplied it could not be set later.

    Sometimes it's necessary to create the device before the uuid is known,
    and in such cases the uuid must be filled in after the creation.

    This patch extends DM_DEV_RENAME to accept a uuid accompanied by
    a new flag DM_UUID_FLAG. This can only be done once and if no
    uuid was previously supplied. It cannot be used to change an
    existing uuid.

    DM_VERSION_MINOR is also bumped to 19 to indicate this interface
    extension is available.

    Signed-off-by: Peter Jones
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Peter Jones
     

15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

12 Aug, 2010

10 commits

  • Add devname:mapper/control and MAPPER_CTRL_MINOR module alias
    to support dm-mod module autoloading.

    Signed-off-by: Kay Sievers
    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • This change unifies the various checks and finalization that occurs on a
    table prior to use. By doing so, it allows table construction without
    traversing the dm-ioctl interface.

    Signed-off-by: Will Drewry
    Signed-off-by: Alasdair G Kergon

    Will Drewry
     
  • Change bio-based mapped devices no longer to have a fully initialized
    request_queue (request_fn, elevator, etc). This means bio-based DM
    devices no longer register elevator sysfs attributes ('iosched/' tree
    or 'scheduler' other than "none").

    In contrast, a request-based DM device will continue to have a full
    request_queue and will register elevator sysfs attributes. Therefore
    a user can determine a DM device's type by checking if elevator sysfs
    attributes exist.

    First allocate a minimalist request_queue structure for a DM device
    (needed for both bio and request-based DM).

    Initialization of a full request_queue is deferred until it is known
    that the DM device is request-based, at the end of the table load
    sequence.

    Factor DM device's request_queue initialization:
    - common to both request-based and bio-based into dm_init_md_queue().
    - specific to request-based into dm_init_request_based_queue().

    The md->type_lock mutex is used to protect md->queue, in addition to
    md->type, during table_load().

    A DM device's first table_load will establish the immutable md->type.
    But md->queue initialization, based on md->type, may fail at that time
    (because blk_init_allocated_queue cannot allocate memory). Therefore
    any subsequent table_load must (re)try dm_setup_md_queue independently of
    establishing md->type.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Determine whether a mapped device is bio-based or request-based when
    loading its first (inactive) table and don't allow that to be changed
    later.

    This patch performs different device initialisation in each of the two
    cases. (We don't think it's necessary to add code to support changing
    between the two types.)

    Allowed md->type transitions:
    DM_TYPE_NONE to DM_TYPE_BIO_BASED
    DM_TYPE_NONE to DM_TYPE_REQUEST_BASED

    We now prevent table_load from replacing the inactive table with a
    conflicting type of table even after an explicit table_clear.

    Introduce 'type_lock' into the struct mapped_device to protect md->type
    and to prepare for the next patch that will change the queue
    initialization and allocate memory while md->type_lock is held.

    Signed-off-by: Mike Snitzer
    Acked-by: Kiyoshi Ueda
    Signed-off-by: Alasdair G Kergon

    drivers/md/dm-ioctl.c | 15 +++++++++++++++
    drivers/md/dm.c | 37 ++++++++++++++++++++++++++++++-------
    drivers/md/dm.h | 5 +++++
    include/linux/dm-ioctl.h | 4 ++--
    4 files changed, 52 insertions(+), 9 deletions(-)

    Mike Snitzer
     
  • The dm control device does not implement read/write, so it has no use for
    seeking. Using no_llseek prevents falling back to default_llseek, which
    requires the BKL.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Alasdair G Kergon

    Arnd Bergmann
     
  • This patch separates the device deletion code from dm_put()
    to make sure the deletion happens in the process context.

    By this patch, device deletion always occurs in an ioctl (process)
    context and dm_put() can be called in interrupt context.
    As a result, the request-based dm's bad dm_put() usage pointed out
    by Mikulas below disappears.
    http://marc.info/?l=dm-devel&m=126699981019735&w=2

    Without this patch, I confirmed there is a case to crash the system:
    dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())

    Some more backgrounds and details:
    In request-based dm, a device opener can remove a mapped_device
    while the last request is still completing, because bios in the last
    request complete first and then the device opener can close and remove
    the mapped_device before the last request completes:
    CPU0 CPU1
    =================================================================
    <>
    blk_end_request_all(clone_rq)
    blk_update_request(clone_rq)
    bio_endio(clone_bio) == end_clone_bio
    blk_update_request(orig_rq)
    bio_endio(orig_bio)
    <>
    dm_blk_close()
    dev_remove()
    dm_put(md)
    <>
    blk_finish_request(clone_rq)
    ....
    dm_end_request(clone_rq)
    free_rq_clone(clone_rq)
    blk_end_request_all(orig_rq)
    rq_completed(md)

    So request-based dm used dm_get()/dm_put() to hold md for each I/O
    until its request completion handling is fully done.
    However, the final dm_put() can call the device deletion code which
    must not be run in interrupt context and may cause kernel panic.

    To solve the problem, this patch moves the device deletion code,
    dm_destroy(), to predetermined places that is actually deleting
    the mapped_device in ioctl (process) context, and changes dm_put()
    just to decrement the reference count of the mapped_device.
    By this change, dm_put() can be used in any context and the symmetric
    model below is introduced:
    dm_create(): create a mapped_device
    dm_destroy(): destroy a mapped_device
    dm_get(): increment the reference count of a mapped_device
    dm_put(): decrement the reference count of a mapped_device

    dm_destroy() waits for all references of the mapped_device to disappear,
    then deletes the mapped_device.

    dm_destroy() uses active waiting with msleep(1), since deleting
    the mapped_device isn't performance-critical task.
    And since at this point, nobody opens the mapped_device and no new
    reference will be taken, the pending counts are just for racing
    completing activity and will eventually decrease to zero.

    For the unlikely case of the forced module unload, dm_destroy_immediate(),
    which doesn't wait and forcibly deletes the mapped_device, is also
    introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f"
    may be stuck and never return.
    And now, because the mapped_device is deleted at this point, subsequent
    accesses to the mapped_device may cause NULL pointer references.

    Cc: stable@kernel.org
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch changes dm_hash_remove_all() to release _hash_lock when
    removing a device. After removing the device, dm_hash_remove_all()
    takes _hash_lock and searches the hash from scratch again.

    This patch is a preparation for the next patch, which changes device
    deletion code to wait for md reference to be 0. Without this patch,
    the wait in the next patch may cause AB-BA deadlock:
    CPU0 CPU1
    -----------------------------------------------------------------------
    dm_hash_remove_all()
    down_write(_hash_lock)
    table_status()
    md = find_device()
    dm_get(md)
    holders>
    dm_get_live_or_inactive_table()
    dm_get_inactive_table()
    down_write(_hash_lock)

    holders to be 0>

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so
    that userspace knows whether or not to wait for a uevent to be processed
    before continuing,

    The dm rename ioctl sets this flag but was not structured to return it
    to userspace. This patch restructures the rename ioctl processing to
    behave like the other ioctls that return data and so fix this.

    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • __dev_status() cannot fail so make it void and simplify callers.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Remove useless __dev_status call while processing an ioctl that sets up
    device geometry and target message. The data is not returned to
    userspace so there is no point collecting it and in the case of
    target_message it is collected before processing the message so if it
    did return it might be stale.

    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     

06 Mar, 2010

1 commit