07 Jan, 2006

40 commits

  • Replace multiple kmalloc/memset pairs with kzalloc calls.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Substitute:

    page_cache_get -> get_page
    page_cache_release -> put_page
    PAGE_CACHE_SHIFT -> PAGE_SHIFT
    PAGE_CACHE_SIZE -> PAGE_SIZE
    PAGE_CACHE_MASK -> PAGE_MASK
    __free_page -> put_page

    because we aren't using the page cache, we are just using pages.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • With this patch it is possible to poll /proc/mdstat to detect arrays appearing
    or disappearing, to detect failures, recovery starting, recovery completing,
    and devices being added and removed.

    It is similar to the poll-ability of /proc/mounts, though different in that:

    We always report that the file is readable (because face it, it is, even if
    only for EOF).

    We report POLLPRI when there is a change so that select() can detect
    it as an exceptional event. Not only are these exceptional events, but
    that is the mechanism that the current 'mdadm' uses to watch for events
    (It also polls after a timeout).
    (We also report POLLERR like /proc/mounts).

    Finally, we only reset the per-file event counter when the start of the file
    is read, rather than when poll() returns an event. This is more robust as it
    means that an fd will continue to report activity to poll/select until the
    program clearly responds to that activity.

    md_new_event takes an 'mddev' which isn't currently used, but it will be soon.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Add in correct read-error handling for resync and read-only situations.

    When read-only, we don't over-write, so we need to mark the failed drive in
    the r10_bio so we don't re-try it. During resync, we always read all blocks,
    so if there is a read error, we simply over-write it with the good block that
    we found (assuming we found one).

    Note that the recovery case still isn't handled in an interesting way. There
    is nothing useful to do for the 2-copies case. If there are 3 or more copies,
    then we could try reading from one of the non-missing copies, but this is a
    bit complicated and very rarely would be used, so I'm leaving it for now.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Largely just a cross-port from raid1.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We are inadvertently setting the R1BIO_Uptodate bit on read errors when we
    decide not to try correcting (because there are no other working devices).
    This means that the read error is reported to the client as success.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Where performing a user-requested 'check' or 'repair', we read all readable
    devices, and compare the contents. We only write to blocks which had read
    errors, or blocks with content that differs from the first good device found.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Also keep count on the number of errors found.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • There is this "FIXME" comment with a typo in it!! that been annoying me for
    days, so I just had to remove it.

    conf->disks[i].rdev should only be accessed if
    - we know we hold a reference or
    - the mddev->reconfig_sem is down or
    - we have a rcu_readlock

    handle_stripe was referencing rdev in three places without any of these. For
    the first two, get an rcu_readlock. For the last, the same access
    (md_sync_acct call) is made a little later after the rdev has been claimed
    under and rcu_readlock, if R5_Syncio is set. So just use that access...
    However R5_Syncio isn't really needed as the 'syncing' variable contains the
    same information. So use that instead.

    Issues, comment, and fix are identical in raid5 and raid6.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Handling of read errors during resync is separate from handling of read errors
    during normal IO in raid1. A previous patch added support for read errors
    during normal IO. This one adds support for read errors during resync or
    recovery.

    The key differences are that we don't need to freeze the array, because the
    normal handling of resync means that this part of the array will be idle
    except for resync, and the read/overwrite/re-read is needed in a separate
    piece of code.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We are dereferencing ->rdev without an rcu lock!

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • On a read-error we suspend the array, then synchronously read the block from
    other arrays until we find one where we can read it. Then we try writing the
    good data back everywhere and make sure it works. If any write or subsequent
    read fails, only then do we fail the device out of the array.

    To be able to suspend the array, we need to also keep track of how many
    requests are queued for handling by raid1d.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This is a simple port of match functionality across from raid5. If we get a
    read error, we don't kick the drive straight away, but try to over-write with
    good data first.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid6 currently does not check the P/Q syndromes when doing a resync, it just
    calculates the correct value and writes it. Doing the check can reduce writes
    (often to 0) for a resync, and it is needed to properly implement the

    echo check > sync_action

    operation.

    This patch implements the appropriate checks and tidies up some related code.

    It also allows raid6 user-requested resync to bypass the intent bitmap.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This is important because bitmap_create uses
    mddev->resync_max_sectors
    and that doesn't have a valid value until after the array
    has been initialised (with pers->run()).
    [It doesn't make a difference for current personalities that
    support bitmaps, but will make a difference for raid10]

    This has the added advantage of meaning with can move the thread->timeout
    manipulation inside the bitmap.c code instead of sprinkling identical code
    throughout all personalities.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • See patch to md.txt for more details

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Resync code:
    A test that isn't needed,
    a 'compute_block' that makes more sense
    elsewhere (And then doesn't need a test),
    a couple of BUG_ONs to confirm the change makes sense.

    Printks:
    A few were missing KERN_*

    Also fix a typo in a comment..

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid10 needs to put up a barrier to new requests while it does resync or other
    background recovery. The code for this is currently open-coded, slighty
    obscure by its use of two waitqueues, and not documented.

    This patch gathers all the related code into 4 functions, and includes a
    comment which (hopefully) explains what is happening.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid1 needs to put up a barrier to new requests while it does resync or other
    background recovery. The code for this is currently open-coded, slighty
    obscure by its use of two waitqueues, and not documented.

    This patch gathers all the related code into 4 functions, and includes a
    comment which (hopefully) explains what is happening.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • I've been attempting to set up a (Host)RAID mirror with dm_mirror on
    2.6.14.3, and I've been having a strange little problem. The configuration
    in question is a set of 9GB SCSI disks that have 17942584 sectors. I set
    up the dm_mirror table as such:

    0 17942528 mirror core 2 2048 nosync 2 8:48 0 8:64 0

    If I'm not mistaken, this sets up a 9GB RAID1 mriror with 1MB stripes
    across both SCSI disks. The sector count of the dm device is less than the
    size of the disks, so we shouldn't fall off the end. However, I always get
    the messages like this in dmesg when I set up the dm table:

    attempt to access beyond end of device
    sdd: rw=0, want=17958656, limit=17942584

    Clearly, something is trying to read sectors past the end of the drive. I
    traced it down to the __rh_recovery_prepare function in dm-raid1.c, which
    gets called when we're putting the mirror set together. This function
    calls the dirty region log's get_resync_work function to see if there's any
    resync that needs to be done, and queues up any areas that are out of sync.
    The log's get_resync_work function is actually a pointer to the
    core_get_resync_work function in dm-log.c.

    The core_get_resync_work function queries a bitset lc->sync_bits to find
    out if there are any regions that are out of date (i.e. the bit is 0),
    which is where the problem occurs. If every bit in lc->sync_bits is 1
    (which is the case when we've just configured a new RAID1 with the nosync
    option), the find_next_zero_bit does NOT return the size parameter
    (lc->region_count in this case), it returns the size parameter rounded up
    to the nearest multiple of 32! I don't know if this is intentional, but
    i386 and x86_64 both exhibit this behavior.

    In any case, the statement "if (*region == lc->region_count)" looks like
    it's supposed to catch the case where are no regions to resync and
    return 0. Since find_next_zero_bit apparently has a habit of returning
    a value that's larger than lc->region_count, the enclosed patch changes
    the equality test to a greater-than test so that we don't try to resync
    areas outside of the RAID1 region. Seeing as the HostRAID metadata
    lives just past the end of the RAID1 data, mucking around in that area
    is not a good idea.

    I suppose another way to fix this would be to amend find_next_zero_bit so
    that it doesn't return values larger than "size", but I don't know if
    there's a reason for the current behavior.

    Signed-Off-By: Darrick J. Wong
    Acked-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Zap the memory before freeing it so we don't leave crypto information
    around in memory.

    Signed-off-by: Stefan Rompf
    Acked-by: Clemens Fruhwirth
    Acked-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Rompf
     
  • This patch #if 0's the not yet implemented global function kcopyd_cancel().

    Signed-off-by: Adrian Bunk
    Acked-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Add ioctl DM_SKIP_LOCKFS_FLAG for userspace to request that lock_fs is
    bypassed when suspending a device.

    There's no change to the behaviour of existing code that doesn't know about
    the new flag.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Devices only needs syncing when creating snapshots, so make this optional when
    suspending a device.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • Rename frozen_bdev to suspended_bdev and move the bdget outside lockfs. (This
    prepares for making lockfs optional.)

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • This patch introduces a new field to the mirror_set (default_mirror) to store
    the default mirror.

    (A subsequent patch will allow us to change the default mirror in the event of
    a failure.)

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan E Brassow
     
  • Use %llu not %Lu in sscanf/printf format strings.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • This patch removes an unused #define.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Stribblehill
     
  • More snapshot metadata reading into separate function, to prepare for changing
    the place it gets called from.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alasdair G Kergon
     
  • After changing the name of a mapped device, trigger a dm event. (For
    userspace multipath tools.)

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    goggin, edward
     
  • Add dm_get_dev() to get a mapped device given its dev_t.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Teigland
     
  • Abstract dm_find_md() from dm_get_mdptr() to allow use elsewhere.

    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Teigland
     
  • A typical nfsd call trace is
    nfsd -> svc_process -> nfsd_dispatch -> nfsd3_proc_write ->
    nfsd_write ->nfsd_vfs_write -> vfs_writev

    These add up to over 300 bytes on the stack.
    Looking at each of these, I see that nfsd_write (which includes
    nfsd_vfs_write) contributes 0x8c to stack usage itself!!

    It turns out this is because it puts a 'struct iattr' on the stack so
    it can kill suid if needed. The following patch saves about 50 bytes
    off the stack in this call path.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • Both vfs_getattr and i_op->fsync return error statuses which nfsd was
    largely ignoring. This as noticed when exporting directories using fuse.

    This patch cleans up most of the offences, which involves moving the call
    to vfs_getattr out of the xdr encoding routines (where it is too late to
    report an error) into the main NFS procedure handling routines.

    There is still a called to vfs_gettattr (related to the ACL code) where the
    status is ignored, and called to nfsd_sync_dir don't check return status
    either.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Shaw
     
  • I submitted this one previously - svc_tcp_recvfrom currently returns
    any errors to the caller, including ECONNRESET and the like.

    This is something svc_recv isn't able to deal with:

    len = svsk->sk_recvfrom(rqstp);
    [...]
    if (len == 0 || len == -EAGAIN) {
    [...]
    return -EAGAIN;
    }

    [...]
    return len;

    The nfsd main loop will exit when it sees an error code other than
    EAGAIN.

    The following patch fixes this problem

    svc_recv is not equipped to deal with error codes other than EAGAIN,
    and will propagate anything else (such as ECONNRESET) up to nfsd,
    causing it to exit.

    Signed-off-by: Olaf Kirch
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Kirch
     
  • Split the checkpoint list of the transaction into two lists. In the first
    list we keep the buffers that need to be submitted for IO. In the second
    list are kept buffers that were already submitted and we just have to wait
    for the IO to complete. This should simplify a handling of checkpoint
    lists a bit and can eventually be also a performance gain.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • This patch fixes an issue reported by Coverity in kernel/module.c

    Error reported: Cannot reach this line of code "else return ptr;"

    Patch description:
    This is the error path, so 'err' will be negative, the else case
    is not required, this patch removes it.

    Signed-off-by: Jayachandran C.
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jayachandran C
     
  • I missed a use of list_for_each_rcu_safe() in -mm tree. Here is an updated
    patch to fix it. This time tested on a machine that actually uses IPMI...
    (Thanks to Serge Hallyn for spotting this.)

    Signed-off-by: "Paul E. McKenney"
    Cc: Corey Minyard
    Cc: Matt Domsch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney