11 Jan, 2012

1 commit

  • We normally try to avoid reading from write-mostly devices, but when
    we do we really have to check for bad blocks and be sure not to
    try reading them.

    With the current code, best_good_sectors might not get set and that
    causes zero-length read requests to be send down which is very
    confusing.

    This bug was introduced in commit d2eb35acfdccbe2 and so the patch
    is suitable for 3.1.x and 3.2.x

    Reported-and-tested-by: Michał Mirosław
    Reported-and-tested-by: Art -kwaak- van Breemen
    Signed-off-by: NeilBrown
    Cc: stable@vger.kernel.org

    NeilBrown
     

23 Dec, 2011

8 commits


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

01 Nov, 2011

1 commit


26 Oct, 2011

1 commit

  • In 3.0 we changed the way recovery_disabled was handle so that instead
    of testing against zero, we test an mddev-> value against a conf->
    value.
    Two problems:
    1/ one place in raid1 was missed and still sets to '1'.
    2/ We didn't explicitly set the conf-> value at array creation
    time.
    It defaulted to '0' just like the mddev value does so they
    could appear equal and thus disable recovery.
    This did not affect normal 'md' as it calls bind_rdev_to_array
    which changes the mddev value. However the dmraid interface
    doesn't call this and so doesn't change ->recovery_disabled; so at
    array start all recovery is incorrectly disabled.

    So initialise the 'conf' value to one less that the mddev value, so
    the will only be the same when explicitly set that way.

    Reported-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    NeilBrown
     

24 Oct, 2011

1 commit

  • bio originally has the functionality to set the complete cpu, but
    it is broken.

    Chirstoph said that "This code is unused, and from the all the
    discussions lately pretty obviously broken. The only thing keeping
    it serves is creating more confusion and possibly more bugs."

    And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
    with leaving cpu control to the request based drivers, they are the
    only ones that can toggle the setting anyway".

    So this patch tries to remove all the work of controling complete cpu
    from a bio.

    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Signed-off-by: Tao Ma
    Signed-off-by: Jens Axboe

    Tao Ma
     

19 Oct, 2011

1 commit


11 Oct, 2011

7 commits


07 Oct, 2011

3 commits


21 Sep, 2011

1 commit

  • Two related problems:

    1/ some error paths call "md_unregister_thread(mddev->thread)"
    without subsequently clearing ->thread. A subsequent call
    to mddev_unlock will try to wake the thread, and crash.

    2/ Most calls to md_wakeup_thread are protected against the thread
    disappeared either by:
    - holding the ->mutex
    - having an active request, so something else must be keeping
    the array active.
    However mddev_unlock calls md_wakeup_thread after dropping the
    mutex and without any certainty of an active request, so the
    ->thread could theoretically disappear.
    So we need a spinlock to provide some protections.

    So change md_unregister_thread to take a pointer to the thread
    pointer, and ensure that it always does the required locking, and
    clears the pointer properly.

    Reported-by: "Moshe Melnikov"
    Signed-off-by: NeilBrown
    cc: stable@kernel.org

    NeilBrown
     

12 Sep, 2011

1 commit

  • There is very little benefit in allowing to let a ->make_request
    instance update the bios device and sector and loop around it in
    __generic_make_request when we can archive the same through calling
    generic_make_request from the driver and letting the loop in
    generic_make_request handle it.

    Note that various drivers got the return value from ->make_request and
    returned non-zero values for errors.

    Signed-off-by: Christoph Hellwig
    Acked-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Sep, 2011

1 commit

  • A single request to RAID1 or RAID10 might result in multiple
    requests if there are known bad blocks that need to be avoided.

    To detect if we need to submit another write request we test:
    if (sectors_handled < (bio->bi_size >> 9)) {

    However this is after we call **_write_done() so the 'bio' no longer
    belongs to us - the writes could have completed and the bio freed.

    So move the **_write_done call until after the test against
    bio->bi_size.

    This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

    Reported-by: Bruno Wolff III
    Tested-by: Bruno Wolff III
    Signed-off-by: NeilBrown

    NeilBrown
     

28 Jul, 2011

11 commits

  • raid1d is too big with several deep branches.
    So separate them out into their own functions.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • If we cannot read a block from anywhere during recovery, there is
    now a better approach than just giving up.
    We can record a bad block on each device and keep going - being
    careful not to clear the bad block when a write succeeds as it might -
    it will be a write of incorrect data.

    We have now reached the state where - for raid1 - we only call
    md_error if md_set_badblocks has failed.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • If we find a bad block while writing as part of resync/recovery we
    need to report that back to raid1d which must record the bad block,
    or fail the device.

    Similarly when fixing a read error, a further error should just
    record a bad block if possible rather than failing the device.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • When we get a write error (in the data area, not in metadata),
    update the badblock log rather than failing the whole device.

    As the write may well be many blocks, we trying writing each
    block individually and only log the ones which fail.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • When performing write-behind we allocate pages to store the data
    during write.
    Previously we just keep a list of pages. Now we keep a list of
    bi_vec which includes offset and size.
    This means that the r1bio has complete information to create a new
    bio which will be needed for retrying after write errors.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • If we succeed in writing to a block that was recorded as
    being bad, we clear the bad-block record.

    This requires some delayed handling as the bad-block-list update has
    to happen in process-context.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • If we have seen any write error on a drive, then don't write to
    any known-bad blocks on that drive.
    If necessary, we divide the write request up into pieces just
    like we do for reads, so each piece is either all written or
    all not written to any given drive.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • It is only safe to choose not to write to a bad block if that bad
    block is safely recorded in metadata - i.e. if it has been
    'acknowledged'.

    If it hasn't we need to wait for the acknowledgement.

    We support that using rdev->blocked wait and
    md_wait_for_blocked_rdev by introducing a new device flag
    'BlockedBadBlock'.

    This flag is only advisory.
    It is cleared whenever we acknowledge a bad block, so that a waiter
    can re-check the particular bad blocks that it is interested it.

    It should be set by a caller when they find they need to wait.
    This (set after test) is inherently racy, but as
    md_wait_for_blocked_rdev already has a timeout, losing the race will
    have minimal impact.

    When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
    was set incorrectly (see above race).

    We also modify the way we manage 'Blocked' to fit better with the new
    handling of 'BlockedBadBlocks' and to make it consistent between
    externally managed and internally managed metadata. This requires
    that each raidXd loop checks if the metadata needs to be written and
    triggers a write (md_check_recovery) if needed. Otherwise a queued
    write request might cause raidXd to wait for the metadata to write,
    and only that thread can write it.

    Before writing metadata, we set FaultRecorded for all devices that
    are Faulty, then after writing the metadata we clear Blocked for any
    device for which the Fault was certainly Recorded.

    The 'faulty' device flag now appears in sysfs if the device is faulty
    *or* it has unacknowledged bad blocks. So user-space which does not
    understand bad blocks can continue to function correctly.
    User space which does, should not assume a device is faulty until it
    sees the 'faulty' flag, and then sees the list of unacknowledged bad
    blocks is empty.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When performing resync/etc, keep the size of the request
    small enough that it doesn't overlap any known bad blocks.
    Devices with badblocks at the start of the request are completely
    excluded.
    If there is nowhere to read from due to bad blocks, record
    a bad block on each target device.

    Now that we never read from known-bad-blocks we can allow devices with
    known-bad-blocks into a RAID1.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Now that we have a bad block list, we should not read from those
    blocks.
    There are several main parts to this:
    1/ read_balance needs to check for bad blocks, and return not only
    the chosen device, but also how many good blocks are available
    there.
    2/ fix_read_error needs to avoid trying to read from bad blocks.
    3/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio. This count is stored in 'bi_phys_segments'
    4/ retrying a read needs to also be ready to submit a smaller read
    and queue another request for the rest.

    This does not yet handle bad blocks when reading to perform resync,
    recovery, or check.

    'md_trim_bio' will also be used for RAID10, so put it in md.c and
    export it.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As no personality understand bad block lists yet, we must
    reject any device that is known to contain bad blocks.
    As the personalities get taught, these tests can be removed.

    This only applies to raid1/raid5/raid10.
    For linear/raid0/multipath/faulty the whole concept of bad blocks
    doesn't mean anything so there is no point adding the checks.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     

27 Jul, 2011

1 commit