13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

11 Sep, 2013

3 commits

  • Convert the driver shrinkers to the new API. Most changes are compile
    tested only because I either don't have the hardware or it's staging
    stuff.

    FWIW, the md and android code is pretty good, but the rest of it makes me
    want to claw my eyes out. The amount of broken code I just encountered is
    mind boggling. I've added comments explaining what is broken, but I fear
    that some of the code would be best dealt with by being dragged behind the
    bike shed, burying in mud up to it's neck and then run over repeatedly
    with a blunt lawn mower.

    Special mention goes to the zcache/zcache2 drivers. They can't co-exist
    in the build at the same time, they are under different menu options in
    menuconfig, they only show up when you've got the right set of mm
    subsystem options configured and so even compile testing is an exercise in
    pulling teeth. And that doesn't even take into account the horrible,
    broken code...

    [glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: Daniel Vetter
    Cc: Kent Overstreet
    Cc: John Stultz
    Cc: David Rientjes
    Cc: Jerome Glisse
    Cc: Thomas Hellstrom
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • Pull device-mapper updates from Mike Snitzer:
    "Add the ability to collect I/O statistics on user-defined regions of a
    device-mapper device. This dm-stats code required the reintroduction
    of a div64_u64_rem() helper, but as a separate method that doesn't
    slow down div64_u64() -- especially on 32-bit systems.

    Allow the error target to replace request-based DM devices (e.g.
    multipath) in addition to bio-based DM devices.

    Various other small code fixes and improvements to thin-provisioning,
    DM cache and the DM ioctl interface"

    * tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm stripe: silence a couple sparse warnings
    dm: add statistics support
    dm thin: always return -ENOSPC if no_free_space is set
    dm ioctl: cleanup error handling in table_load
    dm ioctl: increase granularity of type_lock when loading table
    dm ioctl: prevent rename to empty name or uuid
    dm thin: set pool read-only if breaking_sharing fails block allocation
    dm thin: prefix pool error messages with pool device name
    dm: allow error target to replace bio-based and request-based targets
    math64: New separate div64_u64_rem helper
    dm space map: optimise sm_ll_dec and sm_ll_inc
    dm btree: prefetch child nodes when walking tree for a dm_btree_del
    dm btree: use pop_frame in dm_btree_del to cleanup code
    dm cache: eliminate holes in cache structure
    dm cache: fix stacking of geometry limits
    dm thin: fix stacking of geometry limits
    dm thin: add data block size limits to Documentation
    dm cache: add data block size limits to code and Documentation
    dm cache: document metadata device is exclussive to a cache
    dm: stop using WQ_NON_REENTRANT

    Linus Torvalds
     
  • Pull md update from Neil Brown:
    "Headline item is multithreading for RAID5 so that more IO/sec can be
    supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6
    calculations and an assortment of bug fixes"

    * tag 'md/3.12' of git://neil.brown.name/md:
    raid5: only wakeup necessary threads
    md/raid5: flush out all pending requests before proceeding with reshape.
    md/raid5: use seqcount to protect access to shape in make_request.
    raid5: sysfs entry to control worker thread number
    raid5: offload stripe handle to workqueue
    raid5: fix stripe release order
    raid5: make release_stripe lockless
    md: avoid deadlock when dirty buffers during md_stop.
    md: Don't test all of mddev->flags at once.
    md: Fix apparent cut-and-paste error in super_90_validate
    raid6/test: replace echo -e with printf
    RAID: add tilegx SIMD implementation of raid6
    md: fix safe_mode buglet.
    md: don't call md_allow_write in get_bitmap_file.

    Linus Torvalds
     

06 Sep, 2013

9 commits

  • Eliminate the following sparse warnings:
    drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static?
    drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static?

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Support the collection of I/O statistics on user-defined regions of
    a DM device. If no regions are defined no statistics are collected so
    there isn't any performance impact. Only bio-based DM devices are
    currently supported.

    Each user-defined region specifies a starting sector, length and step.
    Individual statistics will be collected for each step-sized area within
    the range specified.

    The I/O statistics counters for each step-sized area of a region are
    in the same format as /sys/block/*/stat or /proc/diskstats but extra
    counters (12 and 13) are provided: total time spent reading and
    writing in milliseconds. All these counters may be accessed by sending
    the @stats_print message to the appropriate DM device via dmsetup.

    The creation of DM statistics will allocate memory via kmalloc or
    fallback to using vmalloc space. At most, 1/4 of the overall system
    memory may be allocated by DM statistics. The admin can see how much
    memory is used by reading
    /sys/module/dm_mod/parameters/stats_current_allocated_bytes

    See Documentation/device-mapper/statistics.txt for more details.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • If pool has 'no_free_space' set it means a previous allocation already
    determined the pool has no free space (and failed that allocation with
    -ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do
    not allow the pool to oscillate between allocating blocks and then not.

    But a side-effect of this determinism is that if a user wants to be able
    to allocate new blocks they'll need to reload the pool's table (to clear
    the 'no_free_space' flag). This reload will happen automatically if the
    pool's data volume is resized. But if the user takes action to free a
    lot of space by deleting snapshot volumes, etc the pool will no longer
    allow data allocations to continue without an intervening table reload.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Make use of common cleanup code.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Hold the mapped device's type_lock before calling populate_table() since
    it is where the table's type is determined based on the specified
    targets. There is no need to allow concurrent table loads to race to
    establish the table's targets or type.

    This eliminates the need to grab the lock in dm_table_set_type().

    Also verify that the type_lock is held in both dm_set_md_type() and
    dm_get_md_type().

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • A device-mapper device must always have a name consisting of a non-empty
    string. If the device also has a uuid, this similarly must not be an
    empty string.

    The DM_DEV_CREATE ioctl enforces these rules when the device is created,
    but this patch is needed to enforce them when DM_DEV_RENAME is used to
    change the name or uuid.

    Reported-by: Zdenek Kabelac
    Signed-off-by: Alasdair G Kergon
    Signed-off-by: Mike Snitzer
    Acked-by: Mikulas Patocka

    Alasdair Kergon
     
  • break_sharing() now handles an arbitrary alloc_data_block() error
    the same way as provision_block(): marks pool read-only and errors the
    cell.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Useful to know which pool is experiencing the error.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • It may be useful to switch a request-based table to the "error" target.
    Enhance the DM core to allow a hybrid target_type which is capable of
    handling either bios (via .map) or requests (via .map_rq).

    Add a request-based map function (.map_rq) to the "error" target_type;
    making it DM's first hybrid target. Train dm_table_set_type() to prefer
    the mapped device's established type (request-based or bio-based). If
    the mapped device doesn't have an established type default to making the
    table with the hybrid target(s) bio-based.

    Tested 'dmsetup wipe_table' to work on both bio-based and request-based
    devices.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Joe Jin
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

04 Sep, 2013

1 commit

  • Pull first round of SCSI updates from James Bottomley:
    "This patch set is a set of driver updates (ufs, zfcp, lpfc, mpt2/3sas,
    qla4xxx, qla2xxx [adding support for ISP8044 + other things]).

    We also have a new driver: esas2r which has a number of static checker
    problems, but which I expect to resolve over the -rc course of 3.12
    under the new driver exception.

    We also have the error return that were discussed at LSF"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (118 commits)
    [SCSI] sg: push file descriptor list locking down to per-device locking
    [SCSI] sg: checking sdp->detached isn't protected when open
    [SCSI] sg: no need sg_open_exclusive_lock
    [SCSI] sg: use rwsem to solve race during exclusive open
    [SCSI] scsi_debug: fix logical block provisioning support when unmap_alignment != 0
    [SCSI] scsi_debug: fix endianness bug in sdebug_build_parts()
    [SCSI] qla2xxx: Update the driver version to 8.06.00.08-k.
    [SCSI] qla2xxx: print MAC via %pMR.
    [SCSI] qla2xxx: Correction to message ids.
    [SCSI] qla2xxx: Correctly print out/in mailbox registers.
    [SCSI] qla2xxx: Add a new interface to update versions.
    [SCSI] qla2xxx: Move queue depth ramp down message to i/o debug level.
    [SCSI] qla2xxx: Select link initialization option bits from current operating mode.
    [SCSI] qla2xxx: Add loopback IDC-TIME-EXTEND aen handling support.
    [SCSI] qla2xxx: Set default critical temperature value in cases when ISPFX00 firmware doesn't provide it
    [SCSI] qla2xxx: QLAFX00 make over temperature AEN handling informational, add log for normal temperature AEN
    [SCSI] qla2xxx: Correct Interrupt Register offset for ISPFX00
    [SCSI] qla2xxx: Remove handling of Shutdown Requested AEN from qlafx00_process_aen().
    [SCSI] qla2xxx: Send all AENs for ISPFx00 to above layers.
    [SCSI] qla2xxx: Add changes in initialization for ISPFX00 cards with BIOS
    ...

    Linus Torvalds
     

02 Sep, 2013

1 commit

  • If there are not enough stripes to handle, we'd better not always
    queue all available work_structs. If one worker can only handle small
    or even none stripes, it will impact request merge and create lock
    contention.

    With this patch, the number of work_struct running will depend on
    pending stripes number. Note: some statistics info used in the patch
    are accessed without locking protection. This should doesn't matter,
    we just try best to avoid queue unnecessary work_struct.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     

28 Aug, 2013

6 commits

  • Some requests - particularly 'discard' and 'read' are handled
    differently depending on whether a reshape is active or not.

    It is harmless to assume reshape is active if it isn't but wrong
    to act as though reshape is not active when it is.

    So when we start reshape - after making clear to all requests that
    reshape has started - use mddev_suspend/mddev_resume to flush out all
    requests. This will ensure that no requests will be assuming the
    absence of reshape once it really starts.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • make_request() access various shape parameters (raid_disks, chunk_size
    etc) which might be changed by raid5_start_reshape().

    If the later is called at and awkward time during the form, the wrong
    stripe_head might be used.

    So introduce a 'seqcount' and after finding a stripe_head make sure
    there is no reason to expect that we got the wrong one.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Add a sysfs entry to control running workqueue thread number. If
    group_thread_cnt is set to 0, we will disable workqueue offload handling of
    stripes.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • This is another attempt to create multiple threads to handle raid5 stripes.
    This time I use workqueue.

    raid5 handles request (especially write) in stripe unit. A stripe is page size
    aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
    state machine for the corresponding stripe, which includes reading some disks
    of the stripe, calculating parity, and writing some disks of the stripe. The
    state machine is running in raid5d thread currently. Since there is only one
    thread, it doesn't scale well for high speed storage. An obvious solution is
    multi-threading.

    To get better performance, we have some requirements:
    a. locality. stripe corresponding to request submitted from one cpu is better
    handled in thread in local cpu or local node. local cpu is preferred but some
    times could be a bottleneck, for example, parity calculation is too heavy.
    local node running has wide adaptability.
    b. configurablity. Different setup of raid5 array might need diffent
    configuration. Especially the thread number. More threads don't always mean
    better performance because of lock contentions.

    My original implementation is creating some kernel threads. There are
    interfaces to control which cpu's stripe each thread should handle. And
    userspace can set affinity of the threads. This provides biggest flexibility
    and configurability. But it's hard to use and apparently a new thread pool
    implementation is disfavor.

    Recent workqueue improvement is quite promising. unbound workqueue will be
    bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
    do affinity setting. For example, we can only include one HT sibling in
    affinity. Since work is non-reentrant by default, and we can control running
    thread number by limiting dispatched work_struct number.

    In this patch, I created several stripe worker group. A group is a numa node.
    stripes from cpus of one node will be added to a group list. Workqueue thread
    of one node will only handle stripes of worker group of the node. In this way,
    stripe handling has numa node locality. And as I said, we can control thread
    number by limiting dispatched work_struct number.

    The work_struct callback function handles several stripes in one run. A typical
    work queue usage is to run one unit in each work_struct. In raid5 case, the
    unit is a stripe. But we can't do that:
    a. Though handling a stripe doesn't need lock because of reference accounting
    and stripe isn't in any list, queuing a work_struct for each stripe will make
    workqueue lock contended very heavily.
    b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
    might dispatch request. If each work_struct only handles one stripe, such block
    plug is meaningless.

    This implementation can't do very fine grained configuration. But the numa
    binding is most popular usage model, should be enough for most workloads.

    Note: since we have only one stripe queue, switching to multi-thread might
    decrease request size dispatching down to low level layer. The impact depends
    on thread number, raid configuration and workload. So multi-thread raid5 might
    not be proper for all setups.

    Changes V1 -> V2:
    1. remove WQ_NON_REENTRANT
    2. disabling multi-threading by default
    3. Add more descriptions in changelog

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • patch "make release_stripe lockless" changes the order stripes are released.
    Originally I thought block layer can take care of request merge, but it appears
    there are still some requests not merged. It's easy to fix the order.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • release_stripe still has big lock contention. We just add the stripe to a llist
    without taking device_lock. We let the raid5d thread to do the real stripe
    release, which must hold device_lock anyway. In this way, release_stripe
    doesn't hold any locks.

    The side effect is the released stripes order is changed. But sounds not a big
    deal, stripes are never handled in order. And I thought block layer can already
    do nice request merge, which means order isn't that important.

    I kept the unplug release batch, which is unnecessary with this patch from lock
    contention avoid point of view, and actually if we delete it, the stripe_head
    release_list and lru can share storage. But the unplug release batch is also
    helpful for request merge. We probably can delay wakeup raid5d till unplug, but
    I'm still afraid of the case which raid5d is running.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     

27 Aug, 2013

5 commits

  • When the last process closes /dev/mdX sync_blockdev will be called so
    that all buffers get flushed.
    So if it is then opened for the STOP_ARRAY ioctl to be sent there will
    be nothing to flush.

    However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
    moments before some other process which was writing closes their file
    descriptor, then there won't be a 'last close' and the buffers might
    not get flushed.

    So do_md_stop() calls sync_blockdev(). However at this point it is
    holding ->reconfig_mutex. So if the array is currently 'clean' then
    the writes from sync_blockdev() will not complete until the array
    can be marked dirty and that won't happen until some other thread
    can get ->reconfig_mutex. So we deadlock.

    We need to move the sync_blockdev() call to before we take
    ->reconfig_mutex.
    However then some other thread could open /dev/mdX and write to it
    after we call sync_blockdev() and before we actually stop the array.
    This can leave dirty data in the page cache which is awkward.

    So introduce new flag MD_STILL_CLOSED. Set it before calling
    sync_blockdev(), clear it if anyone does open the file, and abort the
    STOP_ARRAY attempt if it gets set before we lock against further
    opens.

    It is still possible to get problems if you open /dev/mdX, write to
    it, then issue the STOP_ARRAY ioctl. Just don't do that.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • mddev->flags is mostly used to record if an update of the
    metadata is needed. Sometimes the whole field is tested
    instead of just the important bits. This makes it difficult
    to introduce more state bits.

    So replace all bare tests of mddev->flags with tests for the bits
    that actually need testing.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Setting a variable to itself probably wasn't the intention here.

    Signed-off-by: Dave Jones
    Signed-off-by: NeilBrown

    Dave Jones
     
  • Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
    immediately - otherwise the small value might not be honoured.
    However if the previous timeout was 0 meaning "no timeout", we didn't.
    This would mean that no timeout happens until the next write completes,
    which could be a long time.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • There is no really need as GFP_NOIO is very likely sufficient,
    and failure is not catastrophic.

    Calling md_allow_write here will convert a read-auto array to
    read/write which could be confusing when you are just performing
    a read operation.

    Signed-off-by: NeilBrown

    NeilBrown
     

24 Aug, 2013

1 commit


23 Aug, 2013

8 commits

  • Prior to this patch these methods did a lookup followed by an insert.
    Instead they now call a common mutate function that adjusts the value
    according to a callback function. This avoids traversing the data
    structures twice and hence improves performance.

    Also factor out sm_ll_lookup_big_ref_count() for use by both
    sm_ll_lookup() and sm_ll_mutate().

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • dm-btree now takes advantage of dm-bufio's ability to prefetch data via
    dm_bm_prefetch(). Prior to this change many btree node visits were
    causing a synchronous read.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Remove a visited leaf straight away from the stack, rather than
    marking all it's children as visited and letting it get removed on the
    next iteration. May also offer a micro optimisation in dm_btree_del.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Reorder members in the cache structure to eliminate 6 out of 7 holes
    (reclaiming 24 bytes). Also, the 'worker' and 'waker' members no longer
    straddle cachelines.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • Do not blindly override the queue limits (specifically io_min and
    io_opt). Allow traditional stacking of these limits if io_opt is a
    factor of the cache's data block size.

    Without this patch mkfs.xfs does not recognize the cache device's
    provided limits as a useful geometry (e.g. raid) so these hints are
    ignored. This was due to setting io_min to a useless value.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • Do not blindly override the queue limits (specifically io_min and
    io_opt). Allow traditional stacking of these limits if io_opt is a
    factor of the thin-pool's data block size.

    Without this patch mkfs.xfs does not recognize the thin device's
    provided limits as a useful geometry (e.g. raid) so these hints are
    ignored. This was due to setting io_min to a useless value.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • Place upper bound on the cache's data block size (1GB).

    Inform users that the data block size can't be any arbitrary number,
    i.e. its value must be between 32KB and 1GB. Also, it should be a
    multiple of 32KB.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Mike Snitzer
     
  • dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
    WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.

    This patch doesn't introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Acked-by: Joe Thornber

    Tejun Heo
     

17 Aug, 2013

1 commit

  • On sparc32, which includes from :

    drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping'
    include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here

    As mq_remove_mapping() already exists, and the local remove_mapping() is
    used only once, inline it manually to avoid the conflict.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair Kergon
    Acked-by: Joe Thornber

    Geert Uytterhoeven
     

27 Jul, 2013

1 commit


25 Jul, 2013

2 commits

  • If a device in a RAID4/5/6 is being replaced while another is being
    recovered, then the writes to the replacement device currently don't
    happen, resulting in corruption when the replacement completes and the
    new drive takes over.

    This is because the replacement writes are only triggered when
    's.replacing' is set and not when the similar 's.sync' is set (which
    is the case during resync and recovery - it means all devices need to
    be read).

    So schedule those writes when s.replacing is set as well.

    In this case we cannot use "STRIPE_INSYNC" to record that the
    replacement has happened as that is needed for recording that any
    parity calculation is complete. So introduce STRIPE_REPLACED to
    record if the replacement has happened.

    For safety we should also check that STRIPE_COMPUTE_RUN is not set.
    This has a similar effect to the "s.locked == 0" test. The latter
    ensure that now IO has been flagged but not started. The former
    checks if any parity calculation has been flagged by not started.
    We must wait for both of these to complete before triggering the
    'replace'.

    Add a similar test to the subsequent check for "are we finished yet".
    This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
    but it makes it more obvious that the REPLACE will happen before we
    think we are finished.

    Finally if a NeedReplace device is not UPTODATE then that is an
    error. We really must trigger a warning.

    This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2
    (md/raid5: detect and handle replacements during recovery.)
    which introduced replacement for raid5.
    That was in 3.3-rc3, so any stable kernel since then would benefit
    from this fix.

    Cc: stable@vger.kernel.org (3.3+)
    Reported-by: qindehua
    Tested-by: qindehua
    Signed-off-by: NeilBrown

    NeilBrown
     
  • We always need to be careful when calling generic_make_request, as it
    can start a chain of events which might free something that we are
    using.

    Here is one place I wasn't careful enough. If the wbio2 is not in
    use, then it might get freed at the first generic_make_request call.
    So perform all necessary tests first.

    This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
    oops, so fix is suitable for any -stable since then.

    Cc: stable@vger.kernel.org (3.3+)
    Signed-off-by: NeilBrown

    NeilBrown
     

23 Jul, 2013

1 commit

  • Pull block IO driver bits from Jens Axboe:
    "As I mentioned in the core block pull request, due to real life
    circumstances the driver pull request would be late. Now it looks
    like -rc2 late... On the plus side, apart form the rsxx update, these
    are all things that I could argue could go in later in the cycle as
    they are fixes and not features. So even though things are late, it's
    not ALL bad.

    The pull request contains:

    - Updates to bcache, all bug fixes, from Kent.

    - A pile of drbd bug fixes (no big features this time!).

    - xen blk front/back fixes.

    - rsxx driver updates, some of them deferred form 3.10. So should be
    well cooked by now"

    * 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
    bcache: Allocation kthread fixes
    bcache: Fix GC_SECTORS_USED() calculation
    bcache: Journal replay fix
    bcache: Shutdown fix
    bcache: Fix a sysfs splat on shutdown
    bcache: Advertise that flushes are supported
    bcache: check for allocation failures
    bcache: Fix a dumb race
    bcache: Use standard utility code
    bcache: Update email address
    bcache: Delete fuzz tester
    bcache: Document shrinker reserve better
    bcache: FUA fixes
    drbd: Allow online change of al-stripes and al-stripe-size
    drbd: Constants should be UPPERCASE
    drbd: Ignore the exit code of a fence-peer handler if it returns too late
    drbd: Fix rcu_read_lock balance on error path
    drbd: fix error return code in drbd_init()
    drbd: Do not sleep inside rcu
    bcache: Refresh usage docs
    ...

    Linus Torvalds