11 Jun, 2016

1 commit

  • Add some seperation between bio-based and request-based DM core code.

    'struct mapped_device' and other DM core only structures and functions
    have been moved to dm-core.h and all relevant DM core .c files have been
    updated to include dm-core.h rather than dm.h

    DM targets should _never_ include dm-core.h!

    [block core merge conflict resolution from Stephen Rothwell]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Stephen Rothwell

    Mike Snitzer
     

11 Mar, 2016

1 commit

  • smq seems to be performing better than the old mq policy in all
    situations, as well as using a quarter of the memory.

    Make 'mq' an alias for 'smq' when choosing a cache policy. The tunables
    that were present for the old mq are faked, and have no effect. mq
    should be considered deprecated now.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

10 Dec, 2015

2 commits

  • Add support for correcting corrupted blocks using Reed-Solomon.

    This code uses RS(255, N) interleaved across data and hash
    blocks. Each error-correcting block covers N bytes evenly
    distributed across the combined total data, so that each byte is a
    maximum distance away from the others. This makes it possible to
    recover from several consecutive corrupted blocks with relatively
    small space overhead.

    In addition, using verity hashes to locate erasures nearly doubles
    the effectiveness of error correction. Being able to detect
    corrupted blocks also improves performance, because only corrupted
    blocks need to corrected.

    For a 2 GiB partition, RS(255, 253) (two parity bytes for each
    253-byte block) can correct up to 16 MiB of consecutive corrupted
    blocks if erasures can be located, and 8 MiB if they cannot, with
    16 MiB space overhead.

    Signed-off-by: Sami Tolvanen
    Signed-off-by: Mike Snitzer

    Sami Tolvanen
     
  • Prepare for extending dm-verity with an optional object. Follows the
    naming convention used by other DM targets (e.g. dm-cache and dm-era).

    Signed-off-by: Sami Tolvanen
    Signed-off-by: Mike Snitzer

    Sami Tolvanen
     

24 Oct, 2015

1 commit

  • This introduces a simple log for raid5. Data/parity writing to raid
    array first writes to the log, then write to raid array disks. If
    crash happens, we can recovery data from the log. This can speed up
    raid resync and fix write hole issue.

    The log structure is pretty simple. Data/meta data is stored in block
    unit, which is 4k generally. It has only one type of meta data block.
    The meta data block can track 3 types of data, stripe data, stripe
    parity and flush block. MD superblock will point to the last valid
    meta data block. Each meta data block has checksum/seq number, so
    recovery can scan the log correctly. We store a checksum of stripe
    data/parity to the metadata block, so meta data and stripe data/parity
    can be written to log disk together. otherwise, meta data write must
    wait till stripe data/parity is finished.

    For stripe data, meta data block will record stripe data sector and
    size. Currently the size is always 4k. This meta data record can be made
    simpler if we just fix write hole (eg, we can record data of a stripe's
    different disks together), but this format can be extended to support
    caching in the future, which must record data address/size.

    For stripe parity, meta data block will record stripe sector. It's
    size should be 4k (for raid5) or 8k (for raid6). We always store p
    parity first. This format should work for caching too.

    flush block indicates a stripe is in raid array disks. Fixing write
    hole doesn't need this type of meta data, it's for caching extension.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     

12 Jun, 2015

1 commit

  • The stochastic-multi-queue (smq) policy addresses some of the problems
    with the current multiqueue (mq) policy.

    Memory usage
    ------------

    The mq policy uses a lot of memory; 88 bytes per cache block on a 64
    bit machine.

    SMQ uses 28bit indexes to implement it's data structures rather than
    pointers. It avoids storing an explicit hit count for each block. It
    has a 'hotspot' queue rather than a pre cache which uses a quarter of
    the entries (each hotspot block covers a larger area than a single
    cache block).

    All these mean smq uses ~25bytes per cache block. Still a lot of
    memory, but a substantial improvement nontheless.

    Level balancing
    ---------------

    MQ places entries in different levels of the multiqueue structures
    based on their hit count (~ln(hit count)). This means the bottom
    levels generally have the most entries, and the top ones have very
    few. Having unbalanced levels like this reduces the efficacy of the
    multiqueue.

    SMQ does not maintain a hit count, instead it swaps hit entries with
    the least recently used entry from the level above. The over all
    ordering being a side effect of this stochastic process. With this
    scheme we can decide how many entries occupy each multiqueue level,
    resulting in better promotion/demotion decisions.

    Adaptability
    ------------

    The MQ policy maintains a hit count for each cache block. For a
    different block to get promoted to the cache it's hit count has to
    exceed the lowest currently in the cache. This means it can take a
    long time for the cache to adapt between varying IO patterns.
    Periodically degrading the hit counts could help with this, but I
    haven't found a nice general solution.

    SMQ doesn't maintain hit counts, so a lot of this problem just goes
    away. In addition it tracks performance of the hotspot queue, which
    is used to decide which blocks to promote. If the hotspot queue is
    performing badly then it starts moving entries more quickly between
    levels. This lets it adapt to new IO patterns very quickly.

    Performance
    -----------

    In my tests SMQ shows substantially better performance than MQ. Once
    this matures a bit more I'm sure it'll become the default policy.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

25 Apr, 2015

1 commit

  • Pull md updates from Neil Brown:
    "More updates that usual this time. A few have performance impacts
    which hould mostly be positive, but RAID5 (in particular) can be very
    work-load ensitive... We'll have to wait and see.

    Highlights:

    - "experimental" code for managing md/raid1 across a cluster using
    DLM. Code is not ready for general use and triggers a WARNING if
    used. However it is looking good and mostly done and having in
    mainline will help co-ordinate development.

    - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
    handle a full (chunk wide) stripe as a single unit.

    - RAID6 can now perform read-modify-write cycles which should help
    performance on larger arrays: 6 or more devices.

    - RAID5/6 stripe cache now grows and shrinks dynamically. The value
    set is used as a minimum.

    - Resync is now allowed to go a little faster than the 'mininum' when
    there is competing IO. How much faster depends on the speed of the
    devices, so the effective minimum should scale with device speed to
    some extent"

    * tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
    md/raid5: don't do chunk aligned read on degraded array.
    md/raid5: allow the stripe_cache to grow and shrink.
    md/raid5: change ->inactive_blocked to a bit-flag.
    md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
    md/raid5: pass gfp_t arg to grow_one_stripe()
    md/raid5: introduce configuration option rmw_level
    md/raid5: activate raid6 rmw feature
    md/raid6 algorithms: xor_syndrome() for SSE2
    md/raid6 algorithms: xor_syndrome() for generic int
    md/raid6 algorithms: improve test program
    md/raid6 algorithms: delta syndrome functions
    raid5: handle expansion/resync case with stripe batching
    raid5: handle io error of batch list
    RAID5: batch adjacent full stripe write
    raid5: track overwrite disk count
    raid5: add a new flag to track if a stripe can be batched
    raid5: use flex_array for scribble data
    md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
    md: allow resync to go faster when there is competing IO.
    md: remove 'go_faster' option from ->sync_request()
    ...

    Linus Torvalds
     

16 Apr, 2015

1 commit

  • Introduce a new target that is meant for file system developers to test file
    system integrity at particular points in the life of a file system. We capture
    all write requests and associated data and log them to a separate device
    for later replay. There is a userspace utility to do this replay. The
    idea behind this is to give file system developers a tool to verify that
    the file system is always consistent.

    Signed-off-by: Josef Bacik
    Reviewed-by: Zach Brown
    Signed-off-by: Mike Snitzer

    Josef Bacik
     

23 Feb, 2015

1 commit


28 Mar, 2014

1 commit

  • dm-era is a target that behaves similar to the linear target. In
    addition it keeps track of which blocks were written within a user
    defined period of time called an 'era'. Each era target instance
    maintains the current era as a monotonically increasing 32-bit
    counter.

    Use cases include tracking changed blocks for backup software, and
    partially invalidating the contents of a cache to restore cache
    coherency after rolling back a vendor snapshot.

    dm-era is primarily expected to be paired with the dm-cache target.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

15 Jan, 2014

1 commit

  • This reverts commit be35f48610 ("dm: wait until embedded kobject is
    released before destroying a device") and provides an improved fix.

    The kobject release code that calls the completion must be placed in a
    non-module file, otherwise there is a module unload race (if the process
    calling dm_kobject_release is preempted and the DM module unloaded after
    the completion is triggered, but before dm_kobject_release returns).

    To fix this race, this patch moves the completion code to dm-builtin.c
    which is always compiled directly into the kernel if BLK_DEV_DM is
    selected.

    The patch introduces a new dm_kobject_holder structure, its purpose is
    to keep the completion and kobject in one place, so that it can be
    accessed from non-module code without the need to export the layout of
    struct mapped_device to that code.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Mikulas Patocka
     

06 Sep, 2013

1 commit

  • Support the collection of I/O statistics on user-defined regions of
    a DM device. If no regions are defined no statistics are collected so
    there isn't any performance impact. Only bio-based DM devices are
    currently supported.

    Each user-defined region specifies a starting sector, length and step.
    Individual statistics will be collected for each step-sized area within
    the range specified.

    The I/O statistics counters for each step-sized area of a region are
    in the same format as /sys/block/*/stat or /proc/diskstats but extra
    counters (12 and 13) are provided: total time spent reading and
    writing in milliseconds. All these counters may be accessed by sending
    the @stats_print message to the appropriate DM device via dmsetup.

    The creation of DM statistics will allocate memory via kmalloc or
    fallback to using vmalloc space. At most, 1/4 of the overall system
    memory may be allocated by DM statistics. The admin can see how much
    memory is used by reading
    /sys/module/dm_mod/parameters/stats_current_allocated_bytes

    See Documentation/device-mapper/statistics.txt for more details.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Jul, 2013

1 commit

  • dm-switch is a new target that maps IO to underlying block devices
    efficiently when there is a large number of fixed-sized address regions
    but there is no simple pattern to allow for a compact mapping
    representation such as dm-stripe.

    Though we have developed this target for a specific storage device, Dell
    EqualLogic, we have made an effort to keep it as general purpose as
    possible in the hope that others may benefit.

    Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.

    Signed-off-by: Jim Ramsay
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Jim Ramsay
     

24 Mar, 2013

1 commit

  • Does writethrough and writeback caching, handles unclean shutdown, and
    has a bunch of other nifty features motivated by real world usage.

    See the wiki at http://bcache.evilpiepirate.org for more.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

02 Mar, 2013

3 commits

  • A simple cache policy that writes back all data to the origin.

    This is used to decommission a dm cache by emptying it.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • A cache policy that uses a multiqueue ordered by recent hit
    count to select which blocks should be promoted and demoted.
    This is meant to be a general purpose policy. It prioritises
    reads over writes.

    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Add a target that allows a fast device such as an SSD to be used as a
    cache for a slower device such as a disk.

    A plug-in architecture was chosen so that the decisions about which data
    to migrate and when are delegated to interchangeable tunable policy
    modules. The first general purpose module we have developed, called
    "mq" (multiqueue), follows in the next patch. Other modules are
    under development.

    Signed-off-by: Joe Thornber
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     

13 Oct, 2012

1 commit


29 Mar, 2012

1 commit

  • This device-mapper target creates a read-only device that transparently
    validates the data on one underlying device against a pre-generated tree
    of cryptographic checksums stored on a second device.

    Two checksum device formats are supported: version 0 which is already
    shipping in Chromium OS and version 1 which incorporates some
    improvements.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mandeep Singh Baines
    Signed-off-by: Will Drewry
    Signed-off-by: Elly Jones
    Cc: Milan Broz
    Cc: Olof Johansson
    Cc: Steffen Klassert
    Cc: Andrew Morton
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

01 Nov, 2011

2 commits

  • Initial EXPERIMENTAL implementation of device-mapper thin provisioning
    with snapshot support. The 'thin' target is used to create instances of
    the virtual devices that are hosted in the 'thin-pool' target. The
    thin-pool target provides data sharing among devices. This sharing is
    made possible using the persistent-data library in the previous patch.

    The main highlight of this implementation, compared to the previous
    implementation of snapshots, is that it allows many virtual devices to
    be stored on the same data volume, simplifying administration and
    allowing sharing of data between volumes (thus reducing disk usage).

    Another big feature is support for arbitrary depth of recursive
    snapshots (snapshots of snapshots of snapshots ...). The previous
    implementation of snapshots did this by chaining together lookup tables,
    and so performance was O(depth). This new implementation uses a single
    data structure so we don't get this degradation with depth.

    For further information and examples of how to use this, please read
    Documentation/device-mapper/thin-provisioning.txt

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • The dm-bufio interface allows you to do cached I/O on devices,
    holding recently-read blocks in memory and performing delayed writes.

    We don't use buffer cache or page cache already present in the kernel, because:
    * we need to handle block sizes larger than a page
    * we can't allocate memory to perform reads or we'd have deadlocks

    Currently, when a cache is required, we limit its size to a fraction of
    available memory. Usage can be viewed and changed in
    /sys/module/dm_bufio/parameters/ .

    The first user is thin provisioning, but more dm users are planned.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

24 Mar, 2011

1 commit

  • This target is the same as the linear target except that it returns I/O
    errors periodically. It's been found useful in simulating failing
    devices for testing purposes.

    I needed a dm target to do some failure testing on btrfs's raid code, and
    Mike pointed me at this.

    Signed-off-by: Josef Bacik
    Signed-off-by: Alasdair G Kergon

    Josef Bacik
     

14 Jan, 2011

1 commit

  • This patch is the skeleton for the DM target that will be
    the bridge from DM to MD (initially RAID456 and later RAID1). It
    provides a way to use device-mapper interfaces to the MD RAID456
    drivers.

    As with all device-mapper targets, the nominal public interfaces are the
    constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
    and STATUSTYPE_TABLE). The CTR table looks like the following:

    1: raid \
    2: \
    3: ..

    Line 1 contains the standard first three arguments to any device-mapper
    target - the start, length, and target type fields. The target type in
    this case is "raid".

    Line 2 contains the arguments that define the particular raid
    type/personality/level, the required arguments for that raid type, and
    any optional arguments. Possible raid types include: raid4, raid5_la,
    raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is
    planned for the future.) The list of required and optional parameters
    is the same for all the current raid types. The required parameters are
    positional, while the optional parameters are given as key/value pairs.
    The possible parameters are as follows:
    Chunk size in sectors.
    [[no]sync] Force/Prevent RAID initialization
    [rebuild ] Rebuild the drive indicated by the index
    [daemon_sleep ] Time between bitmap daemon work to clear bits
    [min_recovery_rate ] Throttle RAID initialization
    [max_recovery_rate ] Throttle RAID initialization
    [max_write_behind ] See '-write-behind=' (man mdadm)
    [stripe_cache ] Stripe cache size for higher RAIDs

    Line 3 contains the list of devices that compose the array in
    metadata/data device pairs. If the metadata is stored separately, a '-'
    is given for the metadata device position. If a drive has failed or is
    missing at creation time, a '-' can be given for both the metadata and
    data drives for a given position.

    Examples:
    # RAID4 - 4 data drives, 1 parity
    # No metadata devices specified to hold superblock/bitmap info
    # Chunk size of 1MiB
    # (Lines separated for easy reading)
    0 1960893648 raid \
    raid4 1 2048 \
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    # RAID4 - 4 data drives, 1 parity (no metadata devices)
    # Chunk size of 1MiB, force RAID initialization,
    # min recovery rate at 20 kiB/sec/disk
    0 1960893648 raid \
    raid4 4 2048 min_recovery_rate 20 sync\
    5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81

    Performing a 'dmsetup table' should display the CTR table used to
    construct the mapping (with possible reordering of optional
    parameters).

    Performing a 'dmsetup status' will yield information on the state and
    health of the array. The output is as follows:
    1: raid \
    2:

    Line 1 is standard DM output. Line 2 is best shown by example:
    0 1960893648 raid raid4 5 AAAAA 2/490221568
    Here we can see the RAID type is raid4, there are 5 devices - all of
    which are 'A'live, and the array is 2/490221568 complete with recovery.

    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    NeilBrown
     

09 Aug, 2010

1 commit


29 Oct, 2009

1 commit


16 Oct, 2009

1 commit


22 Jun, 2009

3 commits

  • This patch contains a device-mapper mirror log module that forwards
    requests to userspace for processing.

    The structures used for communication between kernel and userspace are
    located in include/linux/dm-log-userspace.h. Due to the frequency,
    diversity, and 2-way communication nature of the exchanges between
    kernel and userspace, 'connector' was chosen as the interface for
    communication.

    The first log implementations written in userspace - "clustered-disk"
    and "clustered-core" - support clustered shared storage. A userspace
    daemon (in the LVM2 source code repository) uses openAIS/corosync to
    process requests in an ordered fashion with the rest of the nodes in the
    cluster so as to prevent log state corruption. Other implementations
    with no association to LVM or openAIS/corosync, are certainly possible.

    (Imagine if two machines are writing to the same region of a mirror.
    They would both mark the region dirty, but you need a cluster-aware
    entity that can handle properly marking the region clean when they are
    done. Otherwise, you might clear the region when the first machine is
    done, not the second.)

    Signed-off-by: Jonathan Brassow
    Cc: Evgeniy Polyakov
    Signed-off-by: Alasdair G Kergon

    Jonthan Brassow
     
  • This patch adds a service time oriented dynamic load balancer,
    dm-service-time, which selects the path with the shortest estimated
    service time for the incoming I/O.
    The service time is estimated by dividing the in-flight I/O size
    by a performance value of each path.

    The performance value can be given as a table argument at the table
    loading time. If no performance value is given, all paths are
    considered equal.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch adds a dynamic load balancer, dm-queue-length, which
    balances the number of in-flight I/Os across the paths.

    The code is based on the patch posted by Stefan Bader:
    https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

    Signed-off-by: Stefan Bader
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     

31 Mar, 2009

2 commits

  • Move the raid6 data processing routines into a standalone module
    (raid6_pq) to prepare them to be called from async_tx wrappers and other
    non-md drivers/modules. This precludes a circular dependency of raid456
    needing the async modules for data processing while those modules in
    turn depend on raid456 for the base level synchronous raid6 routines.

    To support this move:
    1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
    2/ The raid6_call, recovery calls, and table symbols are exported
    3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
    compile

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown

    Dan Williams
     
  • Use the -y variables instead of the old -objs so we can easily add
    conditional objects to the modules. Also always use += to add
    subobjects to avoid problems when placing additional objects in
    some place in the file.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: NeilBrown

    Christoph Hellwig
     

06 Jan, 2009

2 commits

  • Move the existing snapshot exception store implementations out into
    separate files. Later patches will place these behind a new
    interface in preparation for alternative implementations.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Implement simple read-only sysfs entry for device-mapper block device.

    This patch adds a simple sysfs directory named "dm" under block device
    properties and implements
    - name attribute (string containing mapped device name)
    - uuid attribute (string containing UUID, or empty string if not set)

    The kobject is embedded in mapped_device struct, so no additional
    memory allocation is needed for initializing sysfs entry.

    During the processing of sysfs attribute we need to lock mapped device
    which is done by a new function dm_get_from_kobj, which returns the md
    associated with kobject and increases the usage count.

    Each 'show attribute' function is responsible for its own locking.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     

22 Oct, 2008

1 commit


05 Jun, 2008

2 commits


25 Apr, 2008

2 commits


20 Oct, 2007

2 commits