09 May, 2007

1 commit

  • Remove do_sync_file_range() and convert callers to just use
    do_sync_mapping_range().

    Signed-off-by: Mark Fasheh
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

08 May, 2007

1 commit

  • Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
    been used in 6 years (so akpm says).

    find * -name \*.[ch] | xargs grep -l invalidate_bdev |
    while read file; do
    quilt add $file;
    sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
    done

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

30 Apr, 2007

1 commit

  • Currently we scale the mempool sizes depending on memory installed
    in the machine, except for the bio pool itself which sits at a fixed
    256 entry pre-allocation.

    There's really no point in "optimizing" this OOM path, we just need
    enough preallocated to make progress. A single unit is enough, lets
    scale it down to 2 just to be on the safe side.

    This patch saves ~150kb of pinned kernel memory on a 32-bit box.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Apr, 2007

1 commit


05 Apr, 2007

1 commit

  • A device can be removed from an md array via e.g.
    echo remove > /sys/block/md3/md/dev-sde/state

    This will try to remove the 'dev-sde' subtree which will deadlock
    since
    commit e7b0d26a86943370c04d6833c6edba2a72a6e240

    With this patch we run the kobject_del via schedule_work so as to
    avoid the deadlock.

    Cc: Alan Stern
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

28 Mar, 2007

3 commits


17 Mar, 2007

1 commit

  • When iterating through an array, one must be careful to test one's index
    variable rather than another similarly-named variable.

    The loop will read off the end of conf->disks[] in the following
    (pathological) case:

    % dd bs=1 seek=840716287 if=/dev/zero of=d1 count=1
    % for i in 2 3 4; do dd if=/dev/zero of=d$i bs=1k count=$(($i+150)); done
    % ./vmlinux ubd0=root ubd1=d1 ubd2=d2 ubd3=d3 ubd4=d4
    # mdadm -C /dev/md0 --level=linear --raid-devices=4 /dev/ubd[1234]

    adding some printks, I saw this:

    [42949374.960000] hash_spacing = 821120
    [42949374.960000] cnt = 4
    [42949374.960000] min_spacing = 801
    [42949374.960000] j=0 size=820928 sz=820928
    [42949374.960000] i=0 sz=820928 hash_spacing=820928
    [42949374.960000] j=1 size=64 sz=64
    [42949374.960000] j=2 size=64 sz=128
    [42949374.960000] j=3 size=64 sz=192
    [42949374.960000] j=4 size=1515870810 sz=1515871002

    Cc: Gautham R Shenoy
    Acked-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Isaacson
     

05 Mar, 2007

1 commit

  • Recent patch for raid6 reshape had a change missing that showed up in
    subsequent review.

    Many places in the raid5 code used "conf->raid_disks-1" to mean "number of
    data disks". With raid6 that had to be changed to "conf->raid_disk -
    conf->max_degraded" or similar. One place was missed.

    This bug means that if a raid6 reshape were aborted in the middle the
    recorded position would be wrong. On restart it would either fail (as the
    position wasn't on an appropriate boundary) or would leave a section of the
    array unreshaped, causing data corruption.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 Mar, 2007

6 commits

  • i.e. one or more drives can be added and the array will re-stripe
    while on-line.

    Most of the interesting work was already done for raid5. This just extends it
    to raid6.

    mdadm newer than 2.6 is needed for complete safety, however any version of
    mdadm which support raid5 reshape will do a good enough job in almost all
    cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a
    reshape that was aborted and had to be restarted with an such a version of
    mdadm).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • An error always aborts any resync/recovery/reshape on the understanding that
    it will immediately be restarted if that still makes sense. However a reshape
    currently doesn't get restarted. With this patch it does.

    To avoid restarting when it is not possible to do work, we call into the
    personality to check that a reshape is ok, and strengthen raid5_check_reshape
    to fail if there are too many failed devices.

    We also break some code out into a separate function: remove_and_add_spares as
    the indent level for that code was getting crazy.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The mddev and queue might be used for another array which does not set these,
    so they need to be cleared.

    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • md tries to warn the user if they e.g. create a raid1 using two partitions of
    the same device, as this does not provide true redundancy.

    However it also warns if a raid0 is created like this, and there is nothing
    wrong with that.

    At the place where the warning is currently printer, we don't necessarily know
    what level the array will be, so move the warning from the point where the
    device is added to the point where the array is started.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • - Use kernel_fpu_begin() and kernel_fpu_end()
    - Use boot_cpu_has() for feature testing even in userspace

    Signed-off-by: H. Peter Anvin
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H. Peter Anvin
     
  • There are two errors that can lead to recovery problems with raid10
    when used in 'far' more (not the default).

    Due to a '>' instead of '>=' the wrong block is located which would result in
    garbage being written to some random location, quite possible outside the
    range of the device, causing the newly reconstructed device to fail.

    The device size calculation had some rounding errors (it didn't round when it
    should) and so recovery would go a few blocks too far which would again cause
    a write to a random block address and probably a device error.

    The code for working with device sizes was fairly confused and spread out, so
    this has been tided up a bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 Feb, 2007

2 commits

  • The semantic effect of insert_at_head is that it would allow new registered
    sysctl entries to override existing sysctl entries of the same name. Which is
    pain for caching and the proc interface never implemented.

    I have done an audit and discovered that none of the current users of
    register_sysctl care as (excpet for directories) they do not register
    duplicate sysctl entries.

    So this patch simply removes the support for overriding existing entries in
    the sys_sysctl interface since no one uses it or cares and it makes future
    enhancments harder.

    Signed-off-by: Eric W. Biederman
    Acked-by: Ralf Baechle
    Acked-by: Martin Schwidefsky
    Cc: Russell King
    Cc: David Howells
    Cc: "Luck, Tony"
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Jens Axboe
    Cc: Corey Minyard
    Cc: Neil Brown
    Cc: "John W. Linville"
    Cc: James Bottomley
    Cc: Jan Kara
    Cc: Trond Myklebust
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The sysctls used by the md driver are have unique binary numbers so remove the
    insert_at_head flag as it serves no useful purpose.

    Signed-off-by: Eric W. Biederman
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

13 Feb, 2007

1 commit

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    [akpm@sdl.org: dvb fix]
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

12 Feb, 2007

1 commit


10 Feb, 2007

2 commits

  • md/bitmap tracks how many active write requests are pending on blocks
    associated with each bit in the bitmap, so that it knows when it can clear
    the bit (when count hits zero).

    The counter has 14 bits of space, so if there are ever more than 16383, we
    cannot cope.

    Currently the code just calles BUG_ON as "all" drivers have request queue
    limits much smaller than this.

    However is seems that some don't. Apparently some multipath configurations
    can allow more than 16383 concurrent write requests.

    So, in this unlikely situation, instead of calling BUG_ON we now wait
    for the count to drop down a bit. This requires a new wait_queue_head,
    some waiting code, and a wakeup call.

    Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
    in that case...).

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • It is possible for raid5 to be sent a bio that is too big for an underlying
    device. So if it is a READ that we pass stright down to a device, it will
    fail and confuse RAID5.

    So in 'chunk_aligned_read' we check that the bio fits within the parameters
    for the target device and if it doesn't fit, fall back on reading through
    the stripe cache and making lots of one-page requests.

    Note that this is the earliest time we can check against the device because
    earlier we don't have a lock on the device, so it could change underneath
    us.

    Also, the code for handling a retry through the cache when a read fails has
    not been tested and was badly broken. This patch fixes that code.

    Signed-off-by: Neil Brown
    Cc: "Kai"
    Cc:
    Cc:
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

27 Jan, 2007

6 commits

  • raid5_mergeable_bvec tries to ensure that raid5 never sees a read request
    that does not fit within just one chunk. However as we must always accept
    a single-page read, that is not always possible.

    So when "in_chunk_boundary" fails, it might be unusual, but it is not a
    problem and printing a message every time is a bad idea.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
    it is possible for a deadlock to eventuate.

    This happens if the array was marked 'clean', and the memalloc triggers a
    write-out to the md device.

    For the writeout to succeed, the array must be marked 'dirty', and that
    requires getting the mddev_lock.

    So, before attempting a GFP_KERNEL allocation while holding the lock, make
    sure the array is marked 'dirty' (unless it is currently read-only).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Allow noflush suspend/resume of device-mapper device only for the case
    where the device size is unchanged.

    Otherwise, dm-multipath devices can stall when resumed if noflush was used
    when suspending them, all paths have failed and queue_if_no_path is set.

    Explanation:
    1. Something is doing fsync() on the block dev,
    holding inode->i_sem
    2. The fsync write is blocked by all-paths-down and queue_if_no_path
    3. Someone requests to suspend the dm device with noflush.
    Pending writes are left in queue.
    4. In the middle of dm_resume(), __bind() tries to get
    inode->i_sem to do __set_size() and waits forever.

    'noflush suspend' is a new device-mapper feature introduced in
    early 2.6.20. So I hope the fix being included before 2.6.20 is
    released.

    Example of reproducer:
    1. Create a multipath device by dmsetup
    2. Fail all paths during mkfs
    3. Do dmsetup suspend --noflush and load new map with healthy paths
    4. Do dmsetup resume

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun'ichi Nomura
     
  • In most cases we check the size of the bitmap file before reading data from
    it. However when reading the superblock, we always read the first PAGE_SIZE
    bytes, which might not always be appropriate. So limit that read to the size
    of the file if appropriate.

    Also, we get the count of available bytes wrong in one place, so that too can
    read past the end of the file.

    Cc: "yang yin"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Now that we sometimes step the array events count backwards (when
    transitioning dirty->clean where nothing else interesting has happened - so
    that we don't need to write to spares all the time), it is possible for the
    event count to return to zero, which is potentially confusing and triggers and
    MD_BUG.

    We could possibly remove the MD_BUG, but is just as easy, and probably safer,
    to make sure we never return to zero.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When 'repair' finds a block that is different one the various parts of the
    mirror. it is meant to write a chosen good version to the others. However it
    currently writes out the original data to each. The memcpy to make all the
    data the same is missing.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

12 Jan, 2007

1 commit

  • md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
    introducing additional latency.

    Fixing this in raid1 and raid10 seems to be straightforward enough.

    For our particular usage case in DRBD, passing this flag improved some
    initialization time from ~5 minutes to ~5 seconds.

    Acked-by: NeilBrown
    Signed-off-by: Lars Ellenberg
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lars Ellenberg
     

23 Dec, 2006

1 commit

  • While developing more functionality in mdadm I found some bugs in md...

    - When we remove a device from an inactive array (write 'remove' to
    the 'state' sysfs file - see 'state_store') would should not
    update the superblock information - as we may not have
    read and processed it all properly yet.

    - initialise all raid_disk entries to '-1' else the 'slot sysfs file
    will claim '0' for all devices in an array before the array is
    started.

    - all '\n' not to be present at the end of words written to
    sysfs files

    - when we use SET_ARRAY_INFO to set the md metadata version,
    set the flag to say that there is persistant metadata.

    - allow GET_BITMAP_FILE to be called on an array that hasn't
    been started yet.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

14 Dec, 2006

1 commit


11 Dec, 2006

9 commits

  • As CBC is the default chaining method for cryptoloop, we should select
    it from cryptoloop to ease the transition. Spotted by Rene Herman.

    Signed-off-by: Herbert Xu
    Signed-off-by: Linus Torvalds

    Herbert Xu
     
  • Fix few bugs that meant that:
    - superblocks weren't alway written at exactly the right time (this
    could show up if the array was not written to - writting to the array
    causes lots of superblock updates and so hides these errors).

    - restarting device recovery after a clean shutdown (version-1 metadata
    only) didn't work as intended (or at all).

    1/ Ensure superblock is updated when a new device is added.
    2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
    The body of this if takes one of two branches depending on whether
    MD_RECOVERY_SYNC is set, so testing it in the clause of the if
    is wrong.
    3/ Flag superblock for updating after a resync/recovery finishes.
    4/ If we find the neeed to restart a recovery in the middle (version-1
    metadata only) make sure a full recovery (not just as guided by
    bitmaps) does get done.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
    to higher levels. While this should be sufficient, it is safer to explicitly
    set the error code as well - less room for confusion.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • There are some vestiges of old code that was used for bypassing the stripe
    cache on reads in raid5.c. This was never updated after the change from
    buffer_heads to bios, but was left as a reminder.

    That functionality has nowe been implemented in a completely different way, so
    the old code can go.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The autorun code is only used if this module is built into the static
    kernel image. Adjust #ifdefs accordingly.

    Signed-off-by: Jeff Garzik
    Acked-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Garzik
     
  • stripe_to_pdidx finds the index of the parity disk for a given stripe. It
    assumes raid5 in that it uses "disks-1" to determine the number of data disks.

    This is incorrect for raid6 but fortunately the two usages cancel each other
    out. The only way that 'data_disks' affects the calculation of pd_idx in
    raid5_compute_sector is when it is divided into the sector number. But as
    that sector number is calculated by multiplying in the wrong value of
    'data_disks' the division produces the right value.

    So it is innocuous but needs to be fixed.

    Also change the calculation of raid_disks in compute_blocknr to make it
    more obviously correct (it seems at first to always use disks-1 too).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Call the chunk_aligned_read where appropriate.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raz Ben-Jehuda(caro)
     
  • If a bypass-the-cache read fails, we simply try again through the cache. If
    it fails again it will trigger normal recovery precedures.

    update 1:

    From: NeilBrown

    1/
    chunk_aligned_read and retry_aligned_read assume that
    data_disks == raid_disks - 1
    which is not true for raid6.
    So when an aligned read request bypasses the cache, we can get the wrong data.

    2/ The cloned bio is being used-after-free in raid5_align_endio
    (to test BIO_UPTODATE).

    3/ We forgot to add rdev->data_offset when submitting
    a bio for aligned-read

    4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
    so we need to invalidate the segment counts.

    5/ We don't de-reference the rdev when the read completes.
    This means we need to record the rdev to so it is still
    available in the end_io routine. Fortunately
    bi_next in the original bio is unused at this point so
    we can stuff it in there.

    6/ We leak a cloned bio if the target rdev is not usable.

    From: NeilBrown

    update 2:

    1/ When aligned requests fail (read error) they need to be retried
    via the normal method (stripe cache). As we cannot be sure that
    we can process a single read in one go (we may not be able to
    allocate all the stripes needed) we store a bio-being-retried
    and a list of bioes-that-still-need-to-be-retried.
    When find a bio that needs to be retried, we should add it to
    the list, not to single-bio...

    2/ We were never incrementing 'scnt' when resubmitting failed
    aligned requests.

    [akpm@osdl.org: build fix]
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raz Ben-Jehuda(caro)
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raz Ben-Jehuda(caro)