10 May, 2007

2 commits

  • This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.

    Quoth Neil Brown:

    "It causes an oops when auto-detecting raid arrays, and it doesn't
    seem easy to fix.

    The array may not be 'open' when do_md_run is called, so
    bdev->bd_disk might be NULL, so bd_set_size can oops.

    This whole approach of opening an md device before it has been
    assembled just seems to get more and more painful. I think I'm going
    to have to come up with something clever to provide both backward
    comparability with usage expectation, and sane integration into the
    rest of the kernel."

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • md currently uses ->media_changed to make sure rescan_partitions
    is call on md array after they are assembled.

    However that doesn't happen until the array is opened, which is later
    than some people would like.

    So use blkdev_ioctl to do the rescan immediately that the
    array has been assembled.

    This means we can remove all the ->change infrastructure as it was only used
    to trigger a partition rescan.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

05 Apr, 2007

1 commit

  • A device can be removed from an md array via e.g.
    echo remove > /sys/block/md3/md/dev-sde/state

    This will try to remove the 'dev-sde' subtree which will deadlock
    since
    commit e7b0d26a86943370c04d6833c6edba2a72a6e240

    With this patch we run the kobject_del via schedule_work so as to
    avoid the deadlock.

    Cc: Alan Stern
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

10 Feb, 2007

1 commit

  • md/bitmap tracks how many active write requests are pending on blocks
    associated with each bit in the bitmap, so that it knows when it can clear
    the bit (when count hits zero).

    The counter has 14 bits of space, so if there are ever more than 16383, we
    cannot cope.

    Currently the code just calles BUG_ON as "all" drivers have request queue
    limits much smaller than this.

    However is seems that some don't. Apparently some multipath configurations
    can allow more than 16383 concurrent write requests.

    So, in this unlikely situation, instead of calling BUG_ON we now wait
    for the count to drop down a bit. This requires a new wait_queue_head,
    some waiting code, and a wakeup call.

    Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
    in that case...).

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

27 Jan, 2007

1 commit

  • If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
    it is possible for a deadlock to eventuate.

    This happens if the array was marked 'clean', and the memalloc triggers a
    write-out to the md device.

    For the writeout to succeed, the array must be marked 'dirty', and that
    requires getting the mddev_lock.

    So, before attempting a GFP_KERNEL allocation while holding the lock, make
    sure the array is marked 'dirty' (unless it is currently read-only).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

11 Dec, 2006

1 commit

  • If a bypass-the-cache read fails, we simply try again through the cache. If
    it fails again it will trigger normal recovery precedures.

    update 1:

    From: NeilBrown

    1/
    chunk_aligned_read and retry_aligned_read assume that
    data_disks == raid_disks - 1
    which is not true for raid6.
    So when an aligned read request bypasses the cache, we can get the wrong data.

    2/ The cloned bio is being used-after-free in raid5_align_endio
    (to test BIO_UPTODATE).

    3/ We forgot to add rdev->data_offset when submitting
    a bio for aligned-read

    4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
    so we need to invalidate the segment counts.

    5/ We don't de-reference the rdev when the read completes.
    This means we need to record the rdev to so it is still
    available in the end_io routine. Fortunately
    bi_next in the original bio is unused at this point so
    we can stuff it in there.

    6/ We leak a cloned bio if the target rdev is not usable.

    From: NeilBrown

    update 2:

    1/ When aligned requests fail (read error) they need to be retried
    via the normal method (stripe cache). As we cannot be sure that
    we can process a single read in one go (we may not be able to
    allocate all the stripes needed) we store a bio-being-retried
    and a list of bioes-that-still-need-to-be-retried.
    When find a bio that needs to be retried, we should add it to
    the list, not to single-bio...

    2/ We were never incrementing 'scnt' when resubmitting failed
    aligned requests.

    [akpm@osdl.org: build fix]
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raz Ben-Jehuda(caro)
     

08 Dec, 2006

1 commit

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Oct, 2006

2 commits


03 Oct, 2006

8 commits

  • Once upon a time we needed to fixed limit to the number of md devices,
    probably because we preallocated some array. This need no longer exists, but
    we still have an arbitrary limit.

    So remove MAX_MD_DEVS and allow as many devices as we can fit into the 'minor'
    part of a device number.

    Also remove some useless noise at init time (which reports MAX_MD_DEVS) and
    remove MD_THREAD_NAME_MAX which hasn't been used for a while.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • It is equivalent to conf->raid_disks - conf->mddev->degraded.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Add a new sysfs interface that allows the bitmap of an array to be dirtied.
    The interface is write-only, and is used as follows:

    echo "1000" > /sys/block/md2/md/bitmap

    (dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
    bitmaps of array md2)

    echo "1000-2000" > /sys/block/md1/md/bitmap

    (dirty the bits for chunks 1000-2000 in md1's bitmap)

    This is useful, for example, in cluster environments where you may need to
    combine two disjoint bitmaps into one (following a server failure, after a
    secondary server has taken over the array). By combining the bitmaps on
    the two servers, a full resync can be avoided (This was discussed on the
    list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Clements
     
  • It isn't needed as mddev->degraded contains equivalent info.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • They are not needed. conf->failed_disks is the same as mddev->degraded and
    conf->working_disks is conf->raid_disks - mddev->degraded.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Instead of magic numbers (0,1,2,3) in sb_dirty, we have
    some flags instead:
    MD_CHANGE_DEVS
    Some device state has changed requiring superblock update
    on all devices.
    MD_CHANGE_CLEAN
    The array has transitions from 'clean' to 'dirty' or back,
    requiring a superblock update on active devices, but possibly
    not on spares
    MD_CHANGE_PENDING
    A superblock update is underway.

    We wait for an update to complete by waiting for all flags to be clear. A
    flag can be set at any time, even during an update, without risk that the
    change will be lost.

    Stop exporting md_update_sb - isn't needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch contains the scheduled removal of the START_ARRAY ioctl for md.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

01 Oct, 2006

1 commit

  • Make it possible to disable the block layer. Not all embedded devices require
    it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
    the block layer to be present.

    This patch does the following:

    (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
    support.

    (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
    an item that uses the block layer. This includes:

    (*) Block I/O tracing.

    (*) Disk partition code.

    (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

    (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
    block layer to do scheduling. Some drivers that use SCSI facilities -
    such as USB storage - end up disabled indirectly from this.

    (*) Various block-based device drivers, such as IDE and the old CDROM
    drivers.

    (*) MTD blockdev handling and FTL.

    (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
    taking a leaf out of JFFS2's book.

    (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
    linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
    however, still used in places, and so is still available.

    (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
    parts of linux/fs.h.

    (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

    (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

    (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
    is not enabled.

    (*) fs/no-block.c is created to hold out-of-line stubs and things that are
    required when CONFIG_BLOCK is not set:

    (*) Default blockdev file operations (to give error ENODEV on opening).

    (*) Makes some /proc changes:

    (*) /proc/devices does not list any blockdevs.

    (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

    (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

    (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
    given command other than Q_SYNC or if a special device is specified.

    (*) In init/do_mounts.c, no reference is made to the blockdev routines if
    CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

    (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
    error ENOSYS by way of cond_syscall if so).

    (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
    CONFIG_BLOCK is not set, since they can't then happen.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     

19 Sep, 2006

1 commit


11 Jul, 2006

1 commit

  • We introduced 'io_sectors' recently so we could count the sectors that causes
    io during resync separate from sectors which didn't cause IO - there can be a
    difference if a bitmap is being used to accelerate resync.

    However when a speed is reported, we find the number of sectors processed
    recently by subtracting an oldish io_sectors count from a current
    'curr_resync' count. This is wrong because curr_resync counts all sectors,
    not just io sectors.

    So, add a field to mddev to store the curren io_sectors separately from
    curr_resync, and use that in the calculations.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

05 Jul, 2006

1 commit

  • * git://git.infradead.org/hdrinstall-2.6:
    Remove export of include/linux/isdn/tpam.h
    Remove and from userspace export
    Restrict headers exported to userspace for SPARC and SPARC64
    Add empty Kbuild files for 'make headers_install' in remaining arches.
    Add Kbuild file for Alpha 'make headers_install'
    Add Kbuild file for SPARC 'make headers_install'
    Add Kbuild file for IA64 'make headers_install'
    Add Kbuild file for S390 'make headers_install'
    Add Kbuild file for i386 'make headers_install'
    Add Kbuild file for x86_64 'make headers_install'
    Add Kbuild file for PowerPC 'make headers_install'
    Add generic Kbuild files for 'make headers_install'
    Basic implementation of 'make headers_check'
    Basic implementation of 'make headers_install'

    Linus Torvalds
     

27 Jun, 2006

9 commits

  • - record the 'event' count on each individual device (they
    might sometimes be slightly different now)
    - add a new value for 'sb_dirty': '3' means that the super
    block only needs to be updated to record a cleandirty
    transition.
    - Prefer odd event numbers for dirty states and even numbers
    for clean states
    - Using all the above, don't update the superblock on
    a spare device if the update is just doing a clean-dirty
    transition. To accomodate this, a transition from
    dirty back to clean might now decrement the events counter
    if nothing else has changed.

    The net effect of this is that spare drives will not see any IO requests
    during normal running of the array, so they can go to sleep if that is what
    they want to do.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • If md is asked to store a bitmap in a file, it tries to hold onto the page
    cache pages for that file, manipulate them directly, and call a cocktail of
    operations to write the file out. I don't believe this is a supportable
    approach.

    This patch changes the approach to use the same approach as swap files. i.e.
    bmap is used to enumerate all the block address of parts of the file and we
    write directly to those blocks of the device.

    swapfile only uses parts of the file that provide a full pages at contiguous
    addresses. We don't have that luxury so we have to cope with pages that are
    non-contiguous in storage. To handle this we attach buffers to each page, and
    store the addresses in those buffers.

    With this approach the pagecache may contain data which is inconsistent with
    what is on disk. To alleviate the problems this can cause, md invalidates the
    pagecache when releasing the file. If the file is to be examined while the
    array is active (a non-critical but occasionally useful function), O_DIRECT io
    must be used. And new version of mdadm will have support for this.

    This approach simplifies a lot of code:
    - we no longer need to keep a list of pages which we need to wait for,
    as the b_endio function can keep track of how many outstanding
    writes there are. This saves a mempool.
    - -EAGAIN returns from write_page are no longer possible (not sure if
    they ever were actually).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md/bitmap currently has a separate thread to wait for writes to the bitmap
    file to complete (as we cannot get a callback on that action).

    However this isn't needed as bitmap_unplug is called from process context and
    waits for the writeback thread to do it's work. The same result can be
    achieved by doing the waiting directly in bitmap_unplug.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch makes the needlessly global md_print_devices() static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The "industry standard" DDF format allows for a stripe/offset layout where
    data is duplicated on different stripes. e.g.

    A B C D
    D A B C
    E F G H
    H E F G

    (columns are drives, rows are stripes, LETTERS are chunks of data).

    This is similar to raid10's 'far' mode, but not quite the same. So enhance
    'far' mode with a 'far/offset' option which follows the layout of DDFs
    stripe/offset.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • For a while we have had checkpointing of resync. The version-1 superblock
    allows recovery to be checkpointed as well, and this patch implements that.

    Due to early carelessness we need to add a feature flag to signal that the
    recovery_offset field is in use, otherwise older kernels would assume that a
    partially recovered array is in fact fully recovered.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • There is a lot of commonality between raid5.c and raid6main.c. This patches
    merges both into one module called raid456. This saves a lot of code, and
    paves the way for online raid5->raid6 migrations.

    There is still duplication, e.g. between handle_stripe5 and handle_stripe6.
    This will probably be cleaned up later.

    Cc: "H. Peter Anvin"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The largest chunk size the code can support without substantial surgery is
    2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day,
    the 'chunksize' should change to a sector-shift instead of a byte-count. Then
    no limit would be needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

18 Jun, 2006

1 commit


11 Apr, 2006

1 commit

  • reshape_position is a 64bit field that was not 64bit aligned. So swap with
    new_level.

    NOTE: this is a user-visible change. However:
    - The bad code has not appeared in a released kernel
    - This code is still marked 'experimental'
    - This only affects version-1 superblock, which are not in wide use
    - These field are only used (rather than simply reported) by user-space
    tools in extemely rare circumstances : after a reshape crashes in the
    first second of the reshape process.

    So I believe that, at this stage, the change is safe. Especially if people
    heed the 'help' message on use mdadm-2.4.1.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

28 Mar, 2006

8 commits

  • ... being careful that mutex_trylock is inverted wrt down_trylock

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This allows user-space to access data safely. This is needed for raid5
    reshape as user-space needs to take a backup of the first few stripes before
    allowing reshape to commence.

    It will also be useful in cluster-aware raid1 configurations so that all
    cluster members can leave a section of the array untouched while a
    resync/recovery happens.

    A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
    Note that only one range can be suspended at a time.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • check_reshape checks validity and does things that can be done instantly -
    like adding devices to raid1. start_reshape initiates a restriping process to
    convert the whole array.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Instead of checkpointing at each stripe, only checkpoint when a new write
    would overwrite uncheckpointed data. Block any write to the uncheckpointed
    area. Arbitrarily checkpoint at least every 3Meg.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We allow the superblock to record an 'old' and a 'new' geometry, and a
    position where any conversion is up to. The geometry allows for changing
    chunksize, layout and level as well as number of devices.

    When using verion-0.90 superblock, we convert the version to 0.91 while the
    conversion is happening so that an old kernel will refuse the assemble the
    array. For version-1, we use a feature bit for the same effect.

    When starting an array we check for an incomplete reshape and restart the
    reshape process if needed. If the reshape stopped at an awkward time (like
    when updating the first stripe) we refuse to assemble the array, and let
    user-space worry about it.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch adds raid5_reshape and end_reshape which will start and finish the
    reshape processes.

    raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
    accidental use.

    Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

    and Make sure that you have backups, just in case.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch provides the core of the resize/expand process.

    sync_request notices if a 'reshape' is happening and acts accordingly.

    It allocated new stripe_heads for the next chunk-wide-stripe in the target
    geometry, marking them STRIPE_EXPANDING.

    Then it finds which stripe heads in the old geometry can provide data needed
    by these and marks them STRIPE_EXPAND_SOURCE. This causes stripe_handle to
    read all blocks on those stripes.

    Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
    needed are copied into the corresponding STRIPE_EXPANDING stripe_head. Once a
    STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
    is written out and released.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We need to allow that different stripes are of different effective sizes, and
    use the appropriate size. Also, when a stripe is being expanded, we must
    block any IO attempts until the stripe is stable again.

    Key elements in this change are:
    - each stripe_head gets a 'disk' field which is part of the key,
    thus there can sometimes be two stripe heads of the same area of
    the array, but covering different numbers of devices. One of these
    will be marked STRIPE_EXPANDING and so won't accept new requests.
    - conf->expand_progress tracks how the expansion is progressing and
    is used to determine whether the target part of the array has been
    expanded yet or not.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown