28 Nov, 2013

3 commits

  • commit 566c09c53455d7c4f1 raid5: relieve lock contention in get_active_stripe()

    modified the locking in get_active_stripe() reducing the range
    protected by the (highly contended) device_lock.
    Unfortunately it reduced the range too much opening up some races.

    One race can occur if get_priority_stripe runs between the
    test on sh->count and device_lock being taken.
    This will mean that sh->lru is not empty while get_active_stripe
    thinks ->count is zero resulting in a 'BUG' firing.

    Another race happens if __release_stripe is called immediately
    after sh->count is tested and found to be non-zero. If STRIPE_HANDLE
    is not set, get_active_stripe should increment ->active_stripes
    when it increments ->count from 0, but as it didn't think it was 0,
    it doesn't.

    Extending device_lock to cover the test on sh->count close these
    races.

    While we are here, fix the two BUG tests:
    -If count is zero, then lru really must not be empty, or we've
    lock the stripe_head somehow - no other tests are relevant.
    -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
    testing it is pointless.

    Reported-and-tested-by: Brassow Jonathan
    Reviewed-by: Shaohua Li
    Fixes: 566c09c53455d7c4f1
    Signed-off-by: NeilBrown

    NeilBrown
     
  • commit 7a0a5355cbc71efa md: Don't test all of mddev->flags at once.
    made most tests on mddev->flags safer, but missed one.

    When
    commit 260fa034ef7a4ff8b7306 md: avoid deadlock when dirty buffers during md_stop.
    added MD_STILL_CLOSED, this caused md_check_recovery to misbehave.
    It can think there is something to do but find nothing. This can
    lead to the md thread spinning during array shutdown.

    https://bugzilla.kernel.org/show_bug.cgi?id=65721

    Reported-and-tested-by: Richard W.M. Jones
    Fixes: 260fa034ef7a4ff8b7306
    Cc: stable@vger.kernel.org (3.12)
    Signed-off-by: NeilBrown

    NeilBrown
     
  • In alloc_thread_groups, worker_groups is a pointer to an array,
    not an array of pointers.
    So
    worker_groups[i]
    is wrong. It should be
    &(*worker_groups)[i]

    Found-by: coverity
    Fixes: 60aaf9338545
    Reported-by: Ben Hutchings
    Cc: majianpeng
    Signed-off-by: NeilBrown

    NeilBrown
     

21 Nov, 2013

1 commit

  • Pull md update from Neil Brown:
    "Mostly optimisations and obscure bug fixes.
    - raid5 gets less lock contention
    - raid1 gets less contention between normal-io and resync-io during
    resync"

    * tag 'md/3.13' of git://neil.brown.name/md:
    md/raid5: Use conf->device_lock protect changing of multi-thread resources.
    md/raid5: Before freeing old multi-thread worker, it should flush them.
    md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
    UAPI: include in linux/raid/md_p.h
    raid1: Rewrite the implementation of iobarrier.
    raid1: Add some macros to make code clearly.
    raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
    raid1: Add a field array_frozen to indicate whether raid in freeze state.
    md: Convert use of typedef ctl_table to struct ctl_table
    md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
    md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
    md: fix some places where mddev_lock return value is not checked.
    raid5: Retry R5_ReadNoMerge flag when hit a read error.
    raid5: relieve lock contention in get_active_stripe()
    raid5: relieve lock contention in get_active_stripe()
    wait: add wait_event_cmd()
    md/raid5.c: add proper locking to error path of raid5_start_reshape.
    md: fix calculation of stacking limits on level change.
    raid5: Use slow_path to release stripe when mddev->thread is null

    Linus Torvalds
     

19 Nov, 2013

13 commits

  • When we change group_thread_cnt from sysfs entry, it can OOPS.

    The kernel messages are:
    [ 135.299021] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 135.299073] IP: [] handle_active_stripes+0x32b/0x440
    [ 135.299107] PGD 0
    [ 135.299122] Oops: 0000 [#1] SMP
    [ 135.299144] Modules linked in: netconsole e1000e ptp pps_core
    [ 135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
    [ 135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
    [ 135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
    [ 135.299283] RIP: 0010:[] [] handle_active_stripes+0x32b/0x440
    [ 135.299323] RSP: 0018:ffff8800b77a5c48 EFLAGS: 00010002
    [ 135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
    [ 135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
    [ 135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
    [ 135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
    [ 135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
    [ 135.299479] FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
    [ 135.299510] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
    [ 135.299559] Stack:
    [ 135.299570] ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
    [ 135.299611] 000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
    [ 135.299654] 000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
    [ 135.299696] Call Trace:
    [ 135.299711] [] ? __wake_up+0x4e/0x70
    [ 135.299733] [] raid5d+0x4c8/0x680
    [ 135.299756] [] ? schedule_timeout+0x15d/0x1f0
    [ 135.299781] [] md_thread+0x11f/0x170
    [ 135.299804] [] ? wake_up_bit+0x40/0x40
    [ 135.299826] [] ? md_rdev_init+0x110/0x110
    [ 135.299850] [] kthread+0xc6/0xd0
    [ 135.299871] [] ? kthread_freezable_should_stop+0x70/0x70
    [ 135.299899] [] ret_from_fork+0x7c/0xb0
    [ 135.299923] [] ? kthread_freezable_should_stop+0x70/0x70
    [ 135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
    [ 135.300005] RIP [] handle_active_stripes+0x32b/0x440
    [ 135.300005] RSP
    [ 135.300005] CR2: 0000000000000000
    [ 135.300005] ---[ end trace 504854e5bb7562ed ]---
    [ 135.300005] Kernel panic - not syncing: Fatal exception

    This is because raid5d() can be running when the multi-thread
    resources are changed via system. We see need to provide locking.

    mddev->device_lock is suitable, but we cannot simple call
    alloc_thread_groups under this lock as we cannot allocate memory
    while holding a spinlock.
    So change alloc_thread_groups() to allocate and return the data
    structures, then raid5_store_group_thread_cnt() can take the lock
    while updating the pointers to the data structures.

    This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
    stable series.

    Fixes: b721420e8719131896b009b11edbbd27
    Cc: stable@vger.kernel.org (3.12)
    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown
    Reviewed-by: Shaohua Li

    majianpeng
     
  • When changing group_thread_cnt from sysfs entry, the kernel can oops.

    The kernel messages are:
    [ 740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    [ 740.961444] IP: [] process_one_work+0x30/0x500
    [ 740.961476] PGD b9013067 PUD b651e067 PMD 0
    [ 740.961503] Oops: 0000 [#1] SMP
    [ 740.961525] Modules linked in: netconsole e1000e ptp pps_core
    [ 740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
    [ 740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
    [ 740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
    [ 740.961673] RIP: 0010:[] [] process_one_work+0x30/0x500
    [ 740.961708] RSP: 0018:ffff88013a247e08 EFLAGS: 00010086
    [ 740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
    [ 740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
    [ 740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
    [ 740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
    [ 740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
    [ 740.961861] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
    [ 740.961893] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
    [ 740.962001] Stack:
    [ 740.962001] 0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
    [ 740.962001] ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
    [ 740.962001] ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
    [ 740.962001] Call Trace:
    [ 740.962001] [] worker_thread+0x206/0x3f0
    [ 740.962001] [] ? manage_workers+0x2c0/0x2c0
    [ 740.962001] [] kthread+0xc6/0xd0
    [ 740.962001] [] ? kthread_freezable_should_stop+0x70/0x70
    [ 740.962001] [] ret_from_fork+0x7c/0xb0
    [ 740.962001] [] ? kthread_freezable_should_stop+0x70/0x70
    [ 740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
    [ 740.962001] RIP [] process_one_work+0x30/0x500
    [ 740.962001] RSP
    [ 740.962001] CR2: 0000000000000008
    [ 740.962001] ---[ end trace 39181460000748de ]---
    [ 740.962001] Kernel panic - not syncing: Fatal exception

    This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
    A worker is queued to handle them.
    But before calling raid5_do_work, raid5d handles those
    stripes making conf->active_stripe = 0.
    So mddev_suspend() can return.
    We might then free old worker resources before the queued
    raid5_do_work() handled them. When it runs, it crashes.

    raid5d() raid5_store_group_thread_cnt()
    queue_work mddev_suspend()
    handle_strips
    active_stripe=0
    free(old worker resources)
    process_one_work
    raid5_do_work

    To avoid this, we should only flush the worker resources before freeing them.

    This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
    stable series.

    Cc: stable@vger.kernel.org (3.12)
    Fixes: b721420e8719131896b009b11edbbd27
    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown
    Reviewed-by: Shaohua Li

    majianpeng
     
  • For R5_ReadNoMerge,it mean this bio can't merge with other bios or
    request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
    same work.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • There is an iobarrier in raid1 because of contention between normal IO and
    resync IO. It suspends all normal IO when resync/recovery happens.

    However if normal IO is out side the resync window, there is no contention.
    So this patch changes the barrier mechanism to only block IO that
    could contend with the resync that is currently happening.

    We partition the whole space into five parts.
    |---------|-----------|------------|----------------|-------|
    start next_resync start_next_window end_window

    start + RESYNC_WINDOW = next_resync
    next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
    start_next_window + NEXT_NORMALIO_DISTANCE = end_window

    Firstly we introduce some concepts:

    1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
    same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
    So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
    2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
    and start_next_window. It also indicates the distance between
    start_next_window and end_window.
    It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
    this turned out not to be optimal.
    3 - next_resync: the next sector at which we will do sync IO.
    4 - start: a position which is at most RESYNC_WINDOW before
    next_resync.
    5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
    beyond next_resync. Normal-io after this position doesn't need to
    wait for resync-io to complete.
    6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
    next_resync. This also doesn't need to wait, but is counted
    differently.
    7 - current_window_requests: the count of normalIO between
    start_next_window and end_window.
    8 - next_window_requests: the count of normalIO after end_window.

    NormalIO will be partitioned into four types:

    NormIO1: the end sector of bio is smaller or equal the start
    NormIO2: the start sector of bio larger or equal to end_window
    NormIO3: the start sector of bio larger or equal to
    start_next_window.
    NormIO4: the location between start_next_window and end_window

    |--------|-----------|--------------------|----------------|-------------|
    | start | next_resync | start_next_window | end_window |
    NormIO1 NormIO4 NormIO4 NormIO3 NormIO2

    For NormIO1, we don't need any io barrier.
    For NormIO4, we used a similar approach to the original iobarrier
    mechanism. The normalIO and resyncIO must be kept separate.
    For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

    For resync action, if there are NormIO4s, we must wait for it.
    If not, we can proceed.
    But if resync action reaches start_next_window and
    current_window_requests > 0 (that is there are NormIO3s), we must
    wait until the current_window_requests becomes zero.
    When current_window_requests becomes zero, start_next_window also
    moves forward. Then current_window_requests will replaced by
    next_window_requests.

    There is a problem which when and how to change from NormIO2 to
    NormIO3. Only then can sync action progress.

    We add a field in struct r1conf "start_next_window".

    A: if start_next_window == MaxSector, it means there are no NormIO2/3.
    So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
    B: if current_window_requests == 0 && next_window_requests != 0, it
    means start_next_window move to end_window

    There is another problem which how to differentiate between
    old NormIO2(now it is NormIO3) and NormIO2.
    For example, there are many bios which are NormIO2 and a bio which is
    NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

    We add a field in struct r1bio "start_next_window".
    This is used to record the position conf->start_next_window when the call
    to wait_barrier() is made in make_request().

    In allow_barrier(), we check the conf->start_next_window.
    If r1bio->stat_next_window == conf->start_next_window, it means
    there is no transition between NormIO2 and NormIO3.
    If r1bio->start_next_window != conf->start_next_window, it mean
    there was a transition between NormIO2 and NormIO3. There can only
    have been one transition. So it only means the bio is old NormIO2.

    For one bio, there may be many r1bio's. So we make sure
    all the r1bio->start_next_window are the same value.
    If we met blocked_dev in make_request(), it must call allow_barrier
    and wait_barrier. So the former and the later value of
    conf->start_next_window will be change.
    If there are many r1bio's with differnet start_next_window,
    for the relevant bio, it depend on the last value of r1bio.
    It will cause error. To avoid this, we must wait for previous r1bios
    to complete.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • In a subsequent patch, we'll use some const parameters.
    Using macros will make the code clearly.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • … reconfiguring the array.

    We used to use raise_barrier to suspend normal IO while we reconfigure
    the array. However raise_barrier will soon only suspend some normal
    IO, not all. So we need something else.
    Change it to use freeze_array.
    But freeze_array not only suspends normal io, it also suspends
    resync io.
    For the place where call raise_barrier for reconfigure, it isn't a
    problem.

    Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

    majianpeng
     
  • Because the following patch will rewrite the content between normal IO
    and resync IO. So we used a parameter to indicate whether raid is in freeze
    array.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • This typedef is unnecessary and should just be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: NeilBrown

    Joe Perches
     
  • When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
    badblock until md_update_sb() is called.
    But md_stop will take reconfig lock which means raid5d can't call
    md_update_sb() in md_check_recovery(), the badblock will always
    be unack, so raid5d thread enters an infinite loop and md_stop_write()
    can never stop sync_thread. This causes deadlock.

    To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
    running, we need set md->recovery FROZEN and INTR flags and wait for
    sync_thread to stop before we (re)take reconfig lock.

    This requires that raid5 reshape_request notices MD_RECOVERY_INTR
    (which it probably should have noticed anyway) and stops waiting for a
    metadata update in that case.

    Reported-by: Jianpeng Ma
    Reported-by: Bian Yu
    Signed-off-by: NeilBrown

    NeilBrown
     
  • We currently use kthread_should_stop() in various places in the
    sync/reshape code to abort early.
    However some places set MD_RECOVERY_INTR but don't immediately call
    md_reap_sync_thread() (and we will shortly get another one).
    When this happens we are relying on md_check_recovery() to reap the
    thread and that only happen when it finishes normally.
    So MD_RECOVERY_INTR must lead to a normal finish without the
    kthread_should_stop() test.

    So replace all relevant tests, and be more careful when the thread is
    interrupted not to acknowledge that latest step in a reshape as it may
    not be fully committed yet.

    Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
    so we don't wait have to wait for the speed to drop before we can abort.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Sometimes we need to lock and mddev and cannot cope with
    failure due to interrupt.
    In these cases we should use mutex_lock, not mutex_lock_interruptible.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Because of block layer merge, one bio fails will cause other bios
    which belongs to the same request fails, so raid5_end_read_request
    will record all these bios as badblocks.
    If retry request with R5_ReadNoMerge flag to avoid bios merge,
    badblocks can only record sector which is bad exactly.

    test:
    hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
    mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
    mdadm /dev/md0 -f /dev/sdd
    mdadm /dev/md0 -r /dev/sdd
    mdadm --zero-superblock /dev/sdd
    mdadm /dev/md0 -a /dev/sdd

    1. Without this patch:
    cat /sys/block/md0/md/rd*/bad_blocks
    299776 256
    299776 256

    2. With this patch:
    cat /sys/block/md0/md/rd*/bad_blocks
    300000 8
    300000 8

    Signed-off-by: Bian Yu
    Signed-off-by: NeilBrown

    Bian Yu
     
  • track empty inactive list count, so md_raid5_congested() can use it to make
    decision.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     

16 Nov, 2013

2 commits

  • Pull trivial tree updates from Jiri Kosina:
    "Usual earth-shaking, news-breaking, rocket science pile from
    trivial.git"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
    doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
    doc: add missing files to timers/00-INDEX
    timekeeping: Fix some trivial typos in comments
    mm: Fix some trivial typos in comments
    irq: Fix some trivial typos in comments
    NUMA: fix typos in Kconfig help text
    mm: update 00-INDEX
    doc: Documentation/DMA-attributes.txt fix typo
    DRM: comment: `halve' -> `half'
    Docs: Kconfig: `devlopers' -> `developers'
    doc: typo on word accounting in kprobes.c in mutliple architectures
    treewide: fix "usefull" typo
    treewide: fix "distingush" typo
    mm/Kconfig: Grammar s/an/a/
    kexec: Typo s/the/then/
    Documentation/kvm: Update cpuid documentation for steal time and pv eoi
    treewide: Fix common typo in "identify"
    __page_to_pfn: Fix typo in comment
    Correct some typos for word frequency
    clk: fixed-factor: Fix a trivial typo
    ...

    Linus Torvalds
     
  • Pull second round of block driver updates from Jens Axboe:
    "As mentioned in the original pull request, the bcache bits were pulled
    because of their dependency on the immutable bio vecs. Kent re-did
    this part and resubmitted it, so here's the 2nd round of (mostly)
    driver updates for 3.13. It contains:

    - The bcache work from Kent.

    - Conversion of virtio-blk to blk-mq. This removes the bio and request
    path, and substitutes with the blk-mq path instead. The end result
    almost 200 deleted lines. Patch is acked by Asias and Christoph, who
    both did a bunch of testing.

    - A removal of bootmem.h include from Grygorii Strashko, part of a
    larger series of his killing the dependency on that header file.

    - Removal of __cpuinit from blk-mq from Paul Gortmaker"

    * 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
    virtio_blk: blk-mq support
    blk-mq: remove newly added instances of __cpuinit
    bcache: defensively handle format strings
    bcache: Bypass torture test
    bcache: Delete some slower inline asm
    bcache: Use ida for bcache block dev minor
    bcache: Fix sysfs splat on shutdown with flash only devs
    bcache: Better full stripe scanning
    bcache: Have btree_split() insert into parent directly
    bcache: Move spinlock into struct time_stats
    bcache: Kill sequential_merge option
    bcache: Kill bch_next_recurse_key()
    bcache: Avoid deadlocking in garbage collection
    bcache: Incremental gc
    bcache: Add make_btree_freeing_key()
    bcache: Add btree_node_write_sync()
    bcache: PRECEDING_KEY()
    bcache: bch_(btree|extent)_ptr_invalid()
    bcache: Don't bother with bucket refcount for btree node allocations
    bcache: Debug code improvements
    ...

    Linus Torvalds
     

15 Nov, 2013

2 commits


14 Nov, 2013

6 commits

  • get_active_stripe() is the last place we have lock contention. It has two
    paths. One is stripe isn't found and new stripe is allocated, the other is
    stripe is found.

    The first path basically calls __find_stripe and init_stripe. It accesses
    conf->generation, conf->previous_raid_disks, conf->raid_disks,
    conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
    conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
    stripe_hashtbl and inactive_list, other fields are changed very rarely.

    With this patch, we split inactive_list and add new hash locks. Each free
    stripe belongs to a specific inactive list. Which inactive list is determined
    by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
    lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
    is determined by it's lock_hash too. The lock_hash is derivied from current
    stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
    to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
    list too. The goal of the new hash locks introduced is we can only use the new
    locks in the first path of get_active_stripe(). Since we have several hash
    locks, lock contention is relieved significantly.

    The first path of get_active_stripe() accesses other fields, since they are
    changed rarely, changing them now need take conf->device_lock and all hash
    locks. For a slow path, this isn't a problem.

    If we need lock device_lock and hash lock, we always lock hash lock first. The
    tricky part is release_stripe and friends. We need take device_lock first.
    Neil's suggestion is we put inactive stripes to a temporary list and readd it
    to inactive_list after device_lock is released. In this way, we add stripes to
    temporary list with device_lock hold and remove stripes from the list with hash
    lock hold. So we don't allow concurrent access to the temporary list, which
    means we need allocate temporary list for all participants of release_stripe.

    One downside is free stripes are maintained in their inactive list, they can't
    across between the lists. By default, we have total 256 stripes and 8 lists, so
    each list will have 32 stripes. It's possible one list has free stripe but
    other list hasn't. The chance should be rare because stripes allocation are
    even distributed. And we can always allocate more stripes for cache, several
    mega bytes memory isn't a big deal.

    This completely removes the lock contention of the first path of
    get_active_stripe(). It slows down the second code path a little bit though
    because we now need takes two locks, but since the hash lock isn't contended,
    the overhead should be quite small (several atomic instructions). The second
    path of get_active_stripe() (basically sequential write or big request size
    randwrite) still has lock contentions.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • If raid5_start_reshape errors out, we need to reset all the fields
    that were updated (not just some), and need to use the seq_counter
    to ensure make_request() doesn't use an inconsitent state.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • The various ->run routines of md personalities assume that the 'queue'
    has been initialised by the blk_set_stacking_limits() call in
    md_alloc().

    However when the level is changed (by level_store()) the ->run routine
    for the new level is called for an array which has already had the
    stacking limits modified. This can result in incorrect final
    settings.

    So call blk_set_stacking_limits() before ->run in level_store().

    A specific consequence of this bug is that it causes
    discard_granularity to be set incorrectly when reshaping a RAID4 to a
    RAID0.

    This is suitable for any -stable kernel since 3.3 in which
    blk_set_stacking_limits() was introduced.

    Cc: stable@vger.kernel.org (3.3+)
    Reported-and-tested-by: "Baldysiak, Pawel"
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When release_stripe() is called in grow_one_stripe(), the
    mddev->thread is null. So it will omit one wakeup this thread to
    release stripe.
    For this condition, use slow_path to release stripe.

    Bug was introduced in 3.12

    Cc: stable@vger.kernel.org (3.12+)
    Fixes: 773ca82fa1ee58dd1bf88b
    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • Pull device mapper changes from Mike Snitzer:
    "A set of device-mapper changes for 3.13.

    Improve reliability of buffer allocations for dm messages with a small
    number of arguments, a couple path group initialization fixes for dm
    multipath, a fix for resizing a dm array, various fixes and
    optimizations for dm cache, a fix for device mapper's Kconfig menu
    indentation.

    Features added include:
    - dm crypt support for activating legacy CBC TrueCrypt containers
    (useful for forensics of these old TCRYPT containers)
    - reduced dm-cache memory requirements for each block in the cache
    - basic support for shrinking a dm-cache's cache (fast) device
    - most notably, dm-cache support for managing cache coherency when
    deploying dm-cache with sophisticated origin volumes (that support
    hardware snapshots and/or clustering): these changes come in the
    form of a new passthrough operation mode and a cache block
    invalidation interface"

    * tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (32 commits)
    dm cache: resolve small nits and improve Documentation
    dm cache: add cache block invalidation support
    dm cache: add remove_cblock method to policy interface
    dm cache policy mq: reduce memory requirements
    dm cache metadata: check the metadata version when reading the superblock
    dm cache: add passthrough mode
    dm cache: cache shrinking support
    dm cache: promotion optimisation for writes
    dm cache: be much more aggressive about promoting writes to discarded blocks
    dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()
    dm cache: optimize commit_if_needed
    dm space map disk: optimise sm_disk_dec_block
    MAINTAINERS: add reference to device-mapper's linux-dm.git tree
    dm: fix Kconfig menu indentation
    dm: allow remove to be deferred
    dm table: print error on preresume failure
    dm crypt: add TCW IV mode for old CBC TCRYPT containers
    dm crypt: properly handle extra key string in initialization
    dm cache: log error message if dm_kcopyd_copy() fails
    dm cache: use cell_defer() boolean argument consistently
    ...

    Linus Torvalds
     
  • Pull block IO core updates from Jens Axboe:
    "This is the pull request for the core changes in the block layer for
    3.13. It contains:

    - The new blk-mq request interface.

    This is a new and more scalable queueing model that marries the
    best part of the request based interface we currently have (which
    is fully featured, but scales poorly) and the bio based "interface"
    which the new drivers for high IOPS devices end up using because
    it's much faster than the request based one.

    The bio interface has no block layer support, since it taps into
    the stack much earlier. This means that drivers end up having to
    implement a lot of functionality on their own, like tagging,
    timeout handling, requeue, etc. The blk-mq interface provides all
    these. Some drivers even provide a switch to select bio or rq and
    has code to handle both, since things like merging only works in
    the rq model and hence is faster for some workloads. This is a
    huge mess. Conversion of these drivers nets us a substantial code
    reduction. Initial results on converting SCSI to this model even
    shows an 8x improvement on single queue devices. So while the
    model was intended to work on the newer multiqueue devices, it has
    substantial improvements for "classic" hardware as well. This code
    has gone through extensive testing and development, it's now ready
    to go. A pull request is coming to convert virtio-blk to this
    model will be will be coming as well, with more drivers scheduled
    for 3.14 conversion.

    - Two blktrace fixes from Jan and Chen Gang.

    - A plug merge fix from Alireza Haghdoost.

    - Conversion of __get_cpu_var() from Christoph Lameter.

    - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

    - A fix for a race between request completion and the timeout
    handling from Jeff Moyer. This is what caused the merge conflict
    with blk-mq/core, in case you are looking at that.

    - A dm stacking fix from Mike Snitzer.

    - A code consolidation fix and duplicated code removal from Kent
    Overstreet.

    - A handful of block bug fixes from Mikulas Patocka, fixing a loop
    crash and memory corruption on blk cg.

    - Elevator switch bug fix from Tomoki Sekiyama.

    A heads-up that I had to rebase this branch. Initially the immutable
    bio_vecs had been queued up for inclusion, but a week later, it became
    clear that it wasn't fully cooked yet. So the decision was made to
    pull this out and postpone it until 3.14. It was a straight forward
    rebase, just pruning out the immutable series and the later fixes of
    problems with it. The rest of the patches applied directly and no
    further changes were made"

    * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: Do not call sector_div() with a 64-bit divisor
    kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
    block: Consolidate duplicated bio_trim() implementations
    block: Use rw_copy_check_uvector()
    block: Enable sysfs nomerge control for I/O requests in the plug list
    block: properly stack underlying max_segment_size to DM device
    elevator: acquire q->sysfs_lock in elevator_change()
    elevator: Fix a race in elevator switching and md device initialization
    block: Replace __get_cpu_var uses
    bdi: test bdi_init failure
    block: fix a probe argument to blk_register_region
    loop: fix crash if blk_alloc_queue fails
    blk-core: Fix memory corruption if blkcg_init_queue fails
    block: fix race between request completion and timeout handling
    blktrace: Send BLK_TN_PROCESS events to all running traces
    blk-mq: don't disallow request merges for req->special being set
    blk-mq: mq plug list breakage
    blk-mq: fix for flush deadlock
    ...

    Linus Torvalds
     

13 Nov, 2013

1 commit


12 Nov, 2013

6 commits

  • Cache block invalidation is removing an entry from the cache without
    writing it back. Cache blocks can be invalidated via the
    'invalidate_cblocks' message, which takes an arbitrary number of cblock
    ranges:
    invalidate_cblocks [|-]*

    E.g.
    dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Implement policy_remove_cblock() and add remove_cblock method to the mq
    policy. These methods will be used by the following cache block
    invalidation patch which adds the 'invalidate_cblocks' message to the
    cache core.

    Also, update some comments in dm-cache-policy.h

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Rather than storing the cblock in each cache entry, we allocate all
    entries in an array and infer the cblock from the entry position.

    Saves 4 bytes of memory per cache block. In addition, this gives us an
    easy way of looking up cache entries by cblock.

    We no longer need to keep an explicit bitset to track which cblocks
    have been allocated. And no searching is needed to find free cblocks.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Need to check the version to verify on-disk metadata is supported.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • "Passthrough" is a dm-cache operating mode (like writethrough or
    writeback) which is intended to be used when the cache contents are not
    known to be coherent with the origin device. It behaves as follows:

    * All reads are served from the origin device (all reads miss the cache)
    * All writes are forwarded to the origin device; additionally, write
    hits cause cache block invalidates

    This mode decouples cache coherency checks from cache device creation,
    largely to avoid having to perform coherency checks while booting. Boot
    scripts can create cache devices in passthrough mode and put them into
    service (mount cached filesystems, for example) without having to worry
    about coherency. Coherency that exists is maintained, although the
    cache will gradually cool as writes take place.

    Later, applications can perform coherency checks, the nature of which
    will depend on the type of the underlying storage. If coherency can be
    verified, the cache device can be transitioned to writethrough or
    writeback mode while still warm; otherwise, the cache contents can be
    discarded prior to transitioning to the desired operating mode.

    Signed-off-by: Joe Thornber
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Morgan Mears
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • Allow a cache to shrink if the blocks being removed from the cache are
    not dirty.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

11 Nov, 2013

6 commits