23 Sep, 2009

1 commit


12 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     

26 May, 2009

1 commit

  • The code for checking which bits in the bitmap can be cleared
    has 2 problems:
    1/ it repeatedly takes and drops a spinlock, where it would make
    more sense to just hold on to it most of the time.
    2/ it doesn't make use of some opportunities to skip large sections
    of the bitmap

    This patch fixes those. It will only affect CPU consumption, not
    correctness.

    Signed-off-by: NeilBrown

    NeilBrown
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

07 May, 2009

2 commits

  • If a write intent bitmap covers more than 2TB, we sometimes work with
    values beyond 32bit, so these need to be sector_t. This patches
    add the required casts to some unsigned longs that are being shifted
    up.

    This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
    member devices that are larger than 2TB.

    Signed-off-by: NeilBrown
    Reported-by: "Mario 'BitKoenig' Holbe"
    Cc: stable@kernel.org

    NeilBrown
     
  • When md is loading a bitmap which it knows is out of date, it fills
    each page with 1s and writes it back out again. However the
    write_page call makes used of bitmap->file_pages and
    bitmap->last_page_size which haven't been set correctly yet. So this
    can sometimes fail.

    Move the setting of file_pages and last_page_size to before the call
    to write_page.

    This bug can cause the assembly on an array to fail, thus making the
    data inaccessible. Hence I think it is a suitable candidate for
    -stable.

    Cc: stable@kernel.org
    Reported-by: Vojtech Pavlik
    Signed-off-by: NeilBrown

    NeilBrown
     

20 Apr, 2009

1 commit


14 Apr, 2009

1 commit

  • The sync_completed file reports how much of a resync (or recovery or
    reshape) has been completed.
    However due to the possibility of out-of-order completion of writes,
    it is not certain to be accurate.

    We have an internal value - mddev->curr_resync_completed - which is an
    accurate value (though it might not always be quite so uptodate).

    So:
    - make curr_resync_completed be uptodate a little more often,
    particularly when raid5 reshape updates status in the metadata
    - report curr_resync_completed in the sysfs file
    - allow poll/select to report all updates to md/sync_completed.

    This makes sync_completed completed usable by any external metadata
    handler that wants to record this status information in its metadata.

    Signed-off-by: NeilBrown

    NeilBrown
     

31 Mar, 2009

8 commits

  • This patch renames the "size" field of struct mddev_s to "dev_sectors"
    and stores the number of 512-byte sectors instead of the number of
    1K-blocks in it.

    All users of that field, including raid levels 1,4-6,10, are adjusted
    accordingly. This simplifies the code a bit because it allows to get
    rid of a couple of divisions/multiplications by two.

    In order to make checkpatch happy, some minor coding style issues
    have also been addressed. In particular, size_store() now uses
    strict_strtoull() instead of simple_strtoull().

    Signed-off-by: Andre Noll
    Signed-off-by: NeilBrown

    Andre Noll
     
  • Version 1.x metadata has the ability to record the status of a
    partially completed drive recovery.
    However we only update that record on a clean shutdown.
    It would be nice to update it on unclean shutdowns too, particularly
    when using a bitmap that removes much to the 'sync' effort after an
    unclean shutdown.

    One complication with checkpointing recovery is that we only know
    where we are up to in terms of IO requests started, not which ones
    have completed. And we need to know what has completed to record
    how much is recovered. So occasionally pause the recovery until all
    submitted requests are completed, then update the record of where
    we are up to.

    When we have a bitmap, we already do that pause occasionally to keep
    the bitmap up-to-date. So enhance that code to record the recovery
    offset and schedule a superblock update.
    And when there is no bitmap, just pause 16 times during the resync to
    do a checkpoint.
    '16' is a fairly arbitrary number. But we don't really have any good
    way to judge how often is acceptable, and it seems like a reasonable
    number for now.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It really is nicer to keep related code together..

    Signed-off-by: NeilBrown

    NeilBrown
     
  • This makes the includes more explicit, and is preparation for moving
    md_k.h to drivers/md/md.h

    Remove include/raid/md.h as its only remaining use was to #include
    other files.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Move the headers with the local structures for the disciplines and
    bitmap.h into drivers/md/ so that they are more easily grepable for
    hacking and not far away. md.h is left where it is for now as there
    are some uses from the outside.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: NeilBrown

    Christoph Hellwig
     
  • When we add some spares to an array and start recovery, and we have
    a bitmap which is stored 'internally' on all devices, we call
    bitmap_write_all to make sure the bitmap is correct on the new
    device(s).
    However that doesn't work as write_sb_page only writes to
    'In_sync' devices, and devices undergoing recovery are not
    'In_sync' until recovery finishes.

    So extend write_sb_page (actually next_active_rdev) to include devices
    that are under recovery.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It is safe to clear a bit from the write-intent bitmap for a raid1
    if we know the data has been written to all devices, which is
    what the current test does.

    But it is not always safe to update the 'events_cleared' counter in
    that case. This is because one request could complete successfully
    after some other request has partially failed.

    So simply disable the clearing and updating of events_cleared whenever
    the array is degraded. This might end up not clearing some bits that
    could safely be cleared, but it is safest approach.

    Note that the bug fixed here did not risk corrupting data by letting
    the array get out-of-sync. Rather it meant that when a device is
    removed and re-added to the array, it might incorrectly require a full
    recovery rather than just recovering based on the bitmap.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • md currently insists that the chunk size used for write-intent
    bitmaps (the amount of data that corresponds to one chunk)
    be at least one page.

    The reason for this restriction is lost in the mists of time,
    but a review of the code (and a vague memory) suggests that the only
    problem would be related to resync. Resync tries very hard to
    work in multiples of a page, but also needs to sync with units
    of a bitmap_chunk too.

    This connection comes out in the bitmap_start_sync call.

    So change bitmap_start_sync to always work in multiples of a page.
    If the bitmap chunk size is less that one page, we flag multiple
    chunks as 'syncing' and generally make them all appear to the
    resync routines like one chunk.

    All other code either already works with data ranges that could
    span multiple chunks, or explicitly only cares about a single chunk.

    Signed-off-by: Neil Brown

    NeilBrown
     

09 Jan, 2009

2 commits

  • The rdev_for_each macro defined in is identical to
    list_for_each_entry_safe, from , it should be defined to
    use list_for_each_entry_safe, instead of reinventing the wheel.

    But some calls to each_entry_safe don't really need a safe version,
    just a direct list_for_each_entry is enough, this could save a temp
    variable (tmp) in every function that used rdev_for_each.

    In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
    totally save many tmp vars; and only in the other situations that will call
    list_del to delete an entry, the safe version is used.

    Signed-off-by: Cheng Renquan
    Signed-off-by: NeilBrown

    Cheng Renquan
     
  • commit a2ed9615e3222645007fc19991aedf30eed3ecfd
    fixed a bug with 'internal' bitmaps, but in the process broke
    'in a file' bitmaps. So they are broken in 2.6.28

    This fixes it, and needs to go in 2.6.28-stable.

    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    NeilBrown
     

19 Dec, 2008

1 commit

  • When we read the write-intent-bitmap off the device, we currently
    read a whole number of pages.
    When PAGE_SIZE is 4K, this works due to the alignment we enforce
    on the superblock and bitmap.
    When PAGE_SIZE is 64K, this case read past the end-of-device
    which causes an error.

    When we write the superblock, we ensure to clip the last page
    to just be the required size. Copy that code into the read path
    to just read the required number of sectors.

    Signed-off-by: Neil Brown
    Cc: stable@kernel.org

    NeilBrown
     

01 Sep, 2008

1 commit

  • A recent patch to protect the rdev list with rcu locking leaves us
    with a problem because we can sleep on memalloc while holding the
    rcu lock.

    The rcu lock is only needed while walking the linked list as
    uninteresting devices (failed or spares) can be removed at any time.

    So only take the rcu lock while actually walking the linked list.
    Take a refcount on the rdev during the time when we drop the lock
    and do the memalloc to start IO.
    When we return to the locked code, all the interesting devices
    on the list will not have moved, so we can simply use
    list_for_each_continue_rcu to pick up where we left off.

    Signed-off-by: NeilBrown

    NeilBrown
     

02 Aug, 2008

1 commit


21 Jul, 2008

1 commit

  • All modifications and most access to the mddev->disks list are made
    under the reconfig_mutex lock. However there are three places where
    the list is walked without any locking. If a reconfig happens at this
    time, havoc (and oops) can ensue.

    So use RCU to protect these accesses:
    - wrap them in rcu_read_{,un}lock()
    - use list_for_each_entry_rcu
    - add to the list with list_add_rcu
    - delete from the list with list_del_rcu
    - delay the 'free' with call_rcu rather than schedule_work

    Note that export_rdev did a list_del_init on this list. In almost all
    cases the entry was not in the list anymore so it was a no-op and so
    safe. It is no longer safe as after list_del_rcu we may not touch
    the list_head.
    An audit shows that export_rdev is called:
    - after unbind_rdev_from_array, in which case the delete has
    already been done,
    - after bind_rdev_to_array fails, in which case the delete isn't needed.
    - before the device has been put on a list at all (e.g. in
    add_new_disk where reading the superblock fails).
    - and in autorun devices after a failure when the device is on a
    different list.

    So remove the list_del_init call from export_rdev, and add it back
    immediately before the called to export_rdev for that last case.

    Note also that ->same_set is sometimes used for lists other than
    mddev->list (e.g. candidates). In these cases rcu is not needed.

    Signed-off-by: NeilBrown

    NeilBrown
     

11 Jul, 2008

1 commit


28 Jun, 2008

1 commit

  • When an array is degraded, bits in the write-intent bitmap are not
    cleared, so that if the missing device is re-added, it can be synced
    by only updated those parts of the device that have changed since
    it was removed.

    The enable this a 'events_cleared' value is stored. It is the event
    counter for the array the last time that any bits were cleared.

    Sometimes - if a device disappears from an array while it is 'clean' -
    the events_cleared value gets updated incorrectly (there are subtle
    ordering issues between updateing events in the main metadata and the
    bitmap metadata) resulting in the missing device appearing to require
    a full resync when it is re-added.

    With this patch, we update events_cleared precisely when we are about
    to clear a bit in the bitmap. We record events_cleared when we clear
    the bit internally, and copy that to the superblock which is written
    out before the bit on storage. This makes it more "obviously correct".

    We also need to update events_cleared when the event_count is going
    backwards (as happens on a dirty->clean transition of a non-degraded
    array).

    Thanks to Mike Snitzer for identifying this problem and testing early
    "fixes".

    Cc: "Mike Snitzer"
    Signed-off-by: Neil Brown

    Neil Brown
     

25 May, 2008

1 commit


11 Mar, 2008

1 commit

  • Recent patch titled
    Reduce CPU wastage on idle md array with a write-intent bitmap.

    would sometimes leave the array with dirty bitmap bits that stay dirty. A
    subsequent write would sort things out so it isn't a big problem, but should
    be fixed nonetheless.

    We need to make sure that when the bitmap becomes not "allclean", the
    daemon_sleep really does get set to a sensible value.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

05 Mar, 2008

1 commit

  • On an md array with a write-intent bitmap, a thread wakes up every few seconds
    and scans the bitmap looking for work to do. If the array is idle, there will
    be no work to do, but a lot of scanning is done to discover this.

    So cache the fact that the bitmap is completely clean, and avoid scanning the
    whole bitmap when the cache is known to be clean.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 Feb, 2008

1 commit

  • d_path() is used on a pair. Lets use a struct path to
    reflect this.

    [akpm@linux-foundation.org: fix build in mm/memory.c]
    Signed-off-by: Jan Blunck
    Acked-by: Bryan Wu
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

07 Feb, 2008

2 commits

  • As this is more in line with common practice in the kernel. Also swap the
    args around to be more like list_for_each.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Currently an md array with a write-intent bitmap does not updated that bitmap
    to reflect successful partial resync. Rather the entire bitmap is updated
    when the resync completes.

    This is because there is no guarentee that resync requests will complete in
    order, and tracking each request individually is unnecessarily burdensome.

    However there is value in regularly updating the bitmap, so add code to
    periodically pause while all pending sync requests complete, then update the
    bitmap. Doing this only every few seconds (the same as the bitmap update
    time) does not notciably affect resync performance.

    [snitzer@gmail.com: export bitmap_cond_end_sync]
    Signed-off-by: Neil Brown
    Cc: "Mike Snitzer"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Nov, 2007

1 commit


23 Oct, 2007

1 commit


18 Jul, 2007

2 commits

  • bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
    to print a message if it returns non-zero, but that message is already printed
    by bitmap_file_kick.

    write_page returns an error which is not consistently checked. It always
    causes BITMAP_WRITE_ERROR to be set on an error, and that can more
    conveniently be checked.

    When the return of write_page is checked, an error causes bitmap_file_kick to
    be called - so move that call into write_page - and protect against recursive
    calls into bitmap_file_kick.

    bitmap_update_sb returns an error that is never checked.

    So make these 'void' and be consistent about checking the bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We current completely trust user-space to set up metadata describing an
    consistant array. In particlar, that the metadata, data, and bitmap do not
    overlap.

    But userspace can be buggy, and it is better to report an error than corrupt
    data. So put in some appropriate checks.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

24 May, 2007

1 commit


09 May, 2007

1 commit

  • Remove do_sync_file_range() and convert callers to just use
    do_sync_mapping_range().

    Signed-off-by: Mark Fasheh
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

13 Apr, 2007

1 commit


12 Feb, 2007

1 commit


10 Feb, 2007

1 commit

  • md/bitmap tracks how many active write requests are pending on blocks
    associated with each bit in the bitmap, so that it knows when it can clear
    the bit (when count hits zero).

    The counter has 14 bits of space, so if there are ever more than 16383, we
    cannot cope.

    Currently the code just calles BUG_ON as "all" drivers have request queue
    limits much smaller than this.

    However is seems that some don't. Apparently some multipath configurations
    can allow more than 16383 concurrent write requests.

    So, in this unlikely situation, instead of calling BUG_ON we now wait
    for the count to drop down a bit. This requires a new wait_queue_head,
    some waiting code, and a wakeup call.

    Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
    in that case...).

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

27 Jan, 2007

1 commit

  • In most cases we check the size of the bitmap file before reading data from
    it. However when reading the superblock, we always read the first PAGE_SIZE
    bytes, which might not always be appropriate. So limit that read to the size
    of the file if appropriate.

    Also, we get the count of available bytes wrong in one place, so that too can
    read past the end of the file.

    Cc: "yang yin"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown