27 Jun, 2006

40 commits

  • The md/dev-XXX/state file can now be written:

    "faulty" simulates an error on the device
    "remove" removes the device from the array (if it is not busy)

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This allows the state of an md/array to be directly controlled via sysfs and
    adds the ability to stop and array without tearing it down.

    Array states/settings:

    clear
    No devices, no size, no level
    Equivalent to STOP_ARRAY ioctl
    inactive
    May have some settings, but array is not active
    all IO results in error
    When written, doesn't tear down array, but just stops it
    suspended (not supported yet)
    All IO requests will block. The array can be reconfigured.
    Writing this, if accepted, will block until array is quiescent
    readonly
    no resync can happen. no superblocks get written.
    write requests fail
    read-auto
    like readonly, but behaves like 'clean' on a write request.

    clean - no pending writes, but otherwise active.
    When written to inactive array, starts without resync
    If a write request arrives then
    if metadata is known, mark 'dirty' and switch to 'active'.
    if not known, block and switch to write-pending
    If written to an active array that has pending writes, then fails.
    active
    fully active: IO and resync can be happening.
    When written to inactive array, starts with resync

    write-pending (not supported yet)
    clean, but writes are blocked waiting for 'active' to be written.

    active-idle
    like active, but no writes have been seen for a while (100msec).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • - record the 'event' count on each individual device (they
    might sometimes be slightly different now)
    - add a new value for 'sb_dirty': '3' means that the super
    block only needs to be updated to record a cleandirty
    transition.
    - Prefer odd event numbers for dirty states and even numbers
    for clean states
    - Using all the above, don't update the superblock on
    a spare device if the update is just doing a clean-dirty
    transition. To accomodate this, a transition from
    dirty back to clean might now decrement the events counter
    if nothing else has changed.

    The net effect of this is that spare drives will not see any IO requests
    during normal running of the array, so they can go to sleep if that is what
    they want to do.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When an array has a bitmap, a device can be removed and re-added and only
    blocks changes since the removal (as recorded in the bitmap) will be resynced.

    It should be possible to do a similar thing to arrays without bitmaps. i.e.
    if a device is removed and re-added and *no* changes have been made in the
    interim, then the add should not require a resync.

    This patch allows that option. This means that when assembling an array one
    device at a time (e.g. during device discovery) the array can be enabled
    read-only as soon as enough devices are available, but extra devices can still
    be added without causing a resync.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • As data_disks is *less* than raid_disks, the current test here is obviously
    wrong. And as the difference is already available in conf->max_degraded, it
    makes much more sense to use that.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • RAID5 recently changed to RAID456

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • I was experimenting with Linux SW raid today and found a spelling error when
    reading the help menus... (and fly spell found more).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Justin Piszcz
     
  • The size calculation made assumtion which the new offset mode didn't
    follow. This gets the size right in all cases.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Fix problems with new bmap based access to bitmap files.

    1/ When not using a file based bitmap, attach a NULL list of buffers
    to each page so the common free_buffer routine can cope.
    2/ Use submit_bh to read as well as write, rather than vfs_read.
    This makes read and write more symetric.
    3/ sync the file before reading, to ensure that the page cache has no
    dirty pages that might get written out later.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • If md is asked to store a bitmap in a file, it tries to hold onto the page
    cache pages for that file, manipulate them directly, and call a cocktail of
    operations to write the file out. I don't believe this is a supportable
    approach.

    This patch changes the approach to use the same approach as swap files. i.e.
    bmap is used to enumerate all the block address of parts of the file and we
    write directly to those blocks of the device.

    swapfile only uses parts of the file that provide a full pages at contiguous
    addresses. We don't have that luxury so we have to cope with pages that are
    non-contiguous in storage. To handle this we attach buffers to each page, and
    store the addresses in those buffers.

    With this approach the pagecache may contain data which is inconsistent with
    what is on disk. To alleviate the problems this can cause, md invalidates the
    pagecache when releasing the file. If the file is to be examined while the
    array is active (a non-critical but occasionally useful function), O_DIRECT io
    must be used. And new version of mdadm will have support for this.

    This approach simplifies a lot of code:
    - we no longer need to keep a list of pages which we need to wait for,
    as the b_endio function can keep track of how many outstanding
    writes there are. This saves a mempool.
    - -EAGAIN returns from write_page are no longer possible (not sure if
    they ever were actually).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md/bitmap modifies i_writecount of a bitmap file to make sure that no-one else
    writes to it. The reverting of the change is sometimes done twice, and there
    is one error path where it is omitted.

    This patch tidies that up.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • bitmap_active is never called, and the BITMAP_ACTIVE flag is never users or
    tested, so discard them both.

    Also remove some out-of-date 'todo' comments.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md/bitmap gets a collection of pages representing the bitmap when it
    initialises the bitmap, and puts all the references when discarding the
    bitmap.

    It also occasionally takes extra references without any good reason, and
    sometimes drops them ... though it doesn't always drop them, which can result
    in a memory leak.

    This patch removes the unnecessary 'get_page' calls, and the corresponding
    'put_page' calls.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • In particular, this means that we use 4 bits per page instead of a whole
    unsigned long.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md/bitmap has some attributes per-page. Handling of these attributes in
    largely abstracted in set_page_attr and clear_page_attr. However
    get_page_attr exposes the format used to store them. So prior to changing
    that format, introduce test_page_attr instead of get_page_attr, and make
    appropriate usage changes.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md/bitmap currently has a separate thread to wait for writes to the bitmap
    file to complete (as we cannot get a callback on that action).

    However this isn't needed as bitmap_unplug is called from process context and
    waits for the writeback thread to do it's work. The same result can be
    achieved by doing the waiting directly in bitmap_unplug.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When "mdadm --grow /dev/mdX --bitmap=none" is used to remove a filebacked
    bitmap, the bitmap was disconnected from the array, but the file wasn't closed
    (until the array was stopped).

    The file also wasn't closed if adding the bitmap file failed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • ... as raid5 sync_request is WAY too big.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch makes the needlessly global md_print_devices() static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The "industry standard" DDF format allows for a stripe/offset layout where
    data is duplicated on different stripes. e.g.

    A B C D
    D A B C
    E F G H
    H E F G

    (columns are drives, rows are stripes, LETTERS are chunks of data).

    This is similar to raid10's 'far' mode, but not quite the same. So enhance
    'far' mode with a 'far/offset' option which follows the layout of DDFs
    stripe/offset.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • For a while we have had checkpointing of resync. The version-1 superblock
    allows recovery to be checkpointed as well, and this patch implements that.

    Due to early carelessness we need to add a feature flag to signal that the
    recovery_offset field is in use, otherwise older kernels would assume that a
    partially recovered array is in fact fully recovered.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • At shutdown, we switch all arrays to read-only, which creates a message for
    every instantiated array, even those which aren't actually active.

    So remove the message for non-active arrays.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • There is a lot of commonality between raid5.c and raid6main.c. This patches
    merges both into one module called raid456. This saves a lot of code, and
    paves the way for online raid5->raid6 migrations.

    There is still duplication, e.g. between handle_stripe5 and handle_stripe6.
    This will probably be cleaned up later.

    Cc: "H. Peter Anvin"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When a md array has been idle (no writes) for 20msecs it is marked as 'clean'.
    This delay turns out to be too short for some real workloads. So increase it
    to 200msec (the time to update the metadata should be a tiny fraction of that)
    and make it sysfs-configurable.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This warning was slightly useful back in 2.2 days, but is more an annoyance
    now. It makes it awkward to add new ioctls (that we we are likely to do that
    in the current climate, but it is possible).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The largest chunk size the code can support without substantial surgery is
    2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day,
    the 'chunksize' should change to a sector-shift instead of a byte-count. Then
    no limit would be needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • A recent change made this goto unnecessary, so reformat the code to make it
    clearer what is happening.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Tidy device-mapper error messages to include context information
    automatically.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • If you misuse the device-mapper interface (or there's a bug in your userspace
    tools) it's possible to end up with 'unlinked' mapped devices that cannot be
    removed until you reboot (along with uninterruptible processes).

    This patch prevents you from removing a device that is still open.

    It introduces dm_lock_for_deletion() which is called when a device is about to
    be removed to ensure that nothing has it open and nothing further can open it.
    It uses a private open_count for this which also lets us remove one of the
    problematic bdget_disk() calls elsewhere.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Add a library function dm_create_error_table() to create a table that rejects
    any I/O sent to a device with EIO.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Teigland
     
  • Move definitions of core device-mapper functions for manipulating mapped
    devices and their tables to advertising their
    availability for use elsewhere in the kernel.

    Protect the contents of device-mapper.h with ifdef __KERNEL__. And throw
    in a few formatting clean-ups and extra comments.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Merge dm_create() and dm_create_with_minor() by introducing the special value
    DM_ANY_MINOR to request the allocation of the next available minor number.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Return sense if dm_split_args is called with a NULL input parameter.

    Signed-off-by: David Teigland
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Teigland
     
  • kcopyd should accumulate errors - otherwise I/O failures may be ignored
    unintentionally.

    And invert 'success' (used in a future patch), using a more intuitive
    !(read_err || write_err).

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Brassow
     
  • When a mirror is reduced in size, clear the part of the bitmap that is no
    longer used.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Fix the 'sizeof' in the region log bitmap size calculation: it's uint32_t, not
    unsigned long - this breaks on some archs.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Refactor the code that creates the core and disk log contexts to avoid the
    repeated allocation of clean_bits introduced by the last patch.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • On-disk logs for dm-mirror devices are currently hard-coded to use 512 byte
    hard-sector-sizes. This patch fixes dm-log so it will work with devices with
    non-512-byte hard-sector-sizes.

    To maintain full compatibility, instead of moving the clean-bits bitset to a
    bitset, and enlarges the disk-header buffer to encompass both the header and
    the bitset. The I/O routines for the bitset are removed, and the I/O routines
    for the disk-header now also read/write the bitset.

    Signed-off-by: Kevin Corry
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kevin Corry
     
  • The table is indexed from 0, so an index equal to t->num_targets should be
    rejected.

    (There is no code in the current tree that would exercise this bug.)

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Milan Broz