09 Jan, 2017

1 commit

  • commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5 upstream.

    Current implementation employ 16bit counter of active stripes in lower
    bits of bio->bi_phys_segments. If request is big enough to overflow
    this counter bio will be completed and freed too early.

    Fortunately this not happens in default configuration because several
    other limits prevent that: stripe_cache_size * nr_disks effectively
    limits count of active stripes. And small max_sectors_kb at lower
    disks prevent that during normal read/write operations.

    Overflow easily happens in discard if it's enabled by module parameter
    "devices_handle_discard_safely" and stripe_cache_size is set big enough.

    This patch limits requests size with 256Mb - 8Kb to prevent overflows.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Shaohua Li
    Cc: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

08 Oct, 2016

1 commit

  • Pull MD updates from Shaohua Li:
    "This update includes:

    - new AVX512 instruction based raid6 gen/recovery algorithm

    - a couple of md-cluster related bug fixes

    - fix a potential deadlock

    - set nonrotational bit for raid array with SSD

    - set correct max_hw_sectors for raid5/6, which hopefuly can improve
    performance a little bit

    - other minor fixes"

    * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    md: set rotational bit
    raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
    raid5: handle register_shrinker failure
    raid5: fix to detect failure of register_shrinker
    md: fix a potential deadlock
    md/bitmap: fix wrong cleanup
    raid5: allow arbitrary max_hw_sectors
    lib/raid6: Add AVX512 optimized xor_syndrome functions
    lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
    lib/raid6: Add AVX512 optimized recovery functions
    lib/raid6: Add AVX512 optimized gen_syndrome functions
    md-cluster: make resync lock also could be interruptted
    md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
    md-cluster: convert the completion to wait queue
    md-cluster: protect md_find_rdev_nr_rcu with rcu lock
    md-cluster: clean related infos of cluster
    md: changes for MD_STILL_CLOSED flag
    md-cluster: remove some unnecessary dlm_unlock_sync
    md-cluster: use FORCEUNLOCK in lockres_free
    md-cluster: call md_kick_rdev_from_array once ack failed

    Linus Torvalds
     

04 Oct, 2016

1 commit

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     

22 Sep, 2016

3 commits


10 Sep, 2016

1 commit

  • commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data)
    moves bio_reset to bio_endio. But it introduces a small race condition.
    It does bio_reset after raid5_release_stripe, which could make the
    stripe reusable and hence reuse the bio just before bio_reset. Moving
    bio_reset before raid5_release_stripe is called should fix the race.

    Reported-and-tested-by: Stefan Priebe - Profihost AG
    Signed-off-by: Shaohua Li

    Shaohua Li
     

07 Sep, 2016

1 commit

  • Install the callbacks via the state machine and let the core invoke
    the callbacks on the already online CPUs.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160818125731.27256-10-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

01 Sep, 2016

1 commit


31 Aug, 2016

1 commit

  • Pull MD fixes from Shaohua Li:
    "This includes several bug fixes:

    - Alexey Obitotskiy fixed a hang for faulty raid5 array with external
    management

    - Song Liu fixed two raid5 journal related bugs

    - Tomasz Majchrzak fixed a bad block recording issue and an
    accounting issue for raid10

    - ZhengYuan Liu fixed an accounting issue for raid5

    - I fixed a potential race condition and memory leak with DIF/DIX
    enabled

    - other trival fixes"

    * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    raid5: avoid unnecessary bio data set
    raid5: fix memory leak of bio integrity data
    raid10: record correct address of bad block
    md-cluster: fix error return code in join()
    r5cache: set MD_JOURNAL_CLEAN correctly
    md: don't print the same repeated messages about delayed sync operation
    md: remove obsolete ret in md_start_sync
    md: do not count journal as spare in GET_ARRAY_INFO
    md: Prevent IO hold during accessing to faulty raid5 array
    MD: hold mddev lock to change bitmap location
    raid5: fix incorrectly counter of conf->empty_inactive_list_nr
    raid10: increment write counter after bio is split

    Linus Torvalds
     

25 Aug, 2016

3 commits

  • bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
    set them every time. bi_private will be set before the bio is
    dispatched.

    Signed-off-by: Shaohua Li

    Shaohua Li
     
  • Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
    doesn't alloc/free bio, instead it reuses bios. There are two issues in
    current code:
    1. the code calls bio_init (from
    init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
    The bio is reused, so likely there is integrity data attached. bio_init
    will clear a pointer to integrity data and makes bio_reset can't release
    the data
    2. bio_reset is called before dispatching bio. After bio is finished,
    it's possible we don't free bio's integrity data (eg, we don't call
    bio_reset again)
    Both issues will cause memory leak. The patch moves bio_init to stripe
    creation and bio_reset to bio end io. This will fix the two issues.

    Reported-by: Yi Zhang
    Signed-off-by: Shaohua Li

    Shaohua Li
     
  • Currently, the code sets MD_JOURNAL_CLEAN when the array has
    MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
    will be MD_JOURNAL_CLEAN even if the journal device is missing.

    With this patch, the MD_JOURNAL_CLEAN is only set when the journal
    device presents.

    Signed-off-by: Song Liu
    Signed-off-by: Shaohua Li

    Song Liu
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Aug, 2016

1 commit

  • After array enters in faulty state (e.g. number of failed drives
    becomes more then accepted for raid5 level) it sets error flags
    (one of this flags is MD_CHANGE_PENDING). For internal metadata
    arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
    external metadata arrays. MD_CHANGE_PENDING flag set prevents to
    finish all new or non-finished IOs to array and hold them in
    pending state. In some cases this can leads to deadlock situation.

    For example, we have faulty array (2 of 4 drives failed) and
    udev handle array state changes and blkid started (or other
    userspace application that used array to read/write) but unable
    to finish reads due to IO hold. At the same time we unable to get
    exclusive access to array (to stop array in our case) because
    another external application still use this array.

    Fix makes possible to return IO with errors immediately.
    So external application can finish working with array and
    give exclusive access to other applications to perform
    required management actions with array.

    Signed-off-by: Alexey Obitotskiy
    Signed-off-by: Shaohua Li

    Alexey Obitotskiy
     

02 Aug, 2016

1 commit

  • The counter conf->empty_inactive_list_nr is only used for determine if the
    raid5 is congested which is deal with in function raid5_congested().
    It was increased in get_free_stripe() when conf->inactive_list got to be
    empty and decreased in release_inactive_stripe_list() when splice
    temp_inactive_list to conf->inactive_list. However, this may have a
    problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
    because these two functions may call list_del_init(&sh->lru) to delete sh from
    "conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
    be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
    done at these two point and increase empty_inactive_list_nr accordingly.
    Otherwise the counter may get to be negative number which would influence
    async readahead from VFS.

    Signed-off-by: ZhengYuan Liu
    Signed-off-by: Shaohua Li

    ZhengYuan Liu
     

29 Jul, 2016

1 commit


21 Jul, 2016

1 commit

  • These two are confusing leftover of the old world order, combining
    values of the REQ_OP_ and REQ_ namespaces. For callers that don't
    special case we mostly just replace bi_rw with bio_data_dir or
    op_is_write, except for the few cases where a switch over the REQ_OP_
    values makes more sense. Any check for READA is replaced with an
    explicit check for REQ_RAHEAD. Also remove the READA alias for
    REQ_RAHEAD.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Jun, 2016

5 commits


08 Jun, 2016

3 commits


26 May, 2016

1 commit


10 May, 2016

2 commits

  • Some code waits for a metadata update by:

    1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
    2. setting MD_CHANGE_PENDING and waking the management thread
    3. waiting for MD_CHANGE_PENDING to be cleared

    If the first two are done without locking, the code in md_update_sb()
    which checks if it needs to repeat might test if an update is needed
    before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
    in the wait returning early.

    So make sure all places that set MD_CHANGE_PENDING are atomicial, and
    bit_clear_unless (suggested by Neil) is introduced for the purpose.

    Cc: Martin Kepplinger
    Cc: Andrew Morton
    Cc: Denys Vlasenko
    Cc: Sasha Levin
    Cc:
    Reviewed-by: NeilBrown
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Shaohua Li

    Guoqing Jiang
     
  • In case md runs underneath the dm-raid target, the mddev does not have
    a request queue or gendisk, thus avoid accesses.

    This patch adds a missing conditional to the raid5 personality.

    Signed-of-by: Heinz Mauelshagen
    Signed-off-by: Shaohua Li

    Heinz Mauelshagen
     

30 Apr, 2016

1 commit

  • If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
    orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
    SkipCopy flag and set page to orig_page. So the warning is unnecessary.

    Reported-by: Joey Liao
    Signed-off-by: Shaohua Li

    Shaohua Li
     

18 Mar, 2016

1 commit

  • The raid456_cpu_notify() hotplug callback lacks handling of the
    CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
    buffer is leaked.

    Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
    to free the scratch buffer.

    CC: Shaohua Li
    CC: linux-raid@vger.kernel.org
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Shaohua Li

    Anna-Maria Gleixner
     

10 Mar, 2016

2 commits

  • Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
    quite convenient if we know the stripe state. This is what this patch does.

    Signed-off-by: Shaohua Li

    Shaohua Li
     
  • break_stripe_batch_list breaks up a batch and copies some flags from
    the batch head to the members, preserving others.

    It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not
    normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
    stripe_head is added to a batch, and is not set on stripe_heads
    already in a batch.

    However there is no locking to ensure one thread doesn't set the flag
    after it has just been cleared in another. This does occasionally happen.

    md/raid5 maintains a count of the number of stripe_heads with
    STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When
    break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
    this could becomes incorrect and will never again return to zero.

    md/raid5 delays the handling of some stripe_heads until
    preread_active_stripes becomes zero. So when the above mention race
    happens, those stripe_heads become blocked and never progress,
    resulting is write to the array handing.

    So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
    in the members of a batch.

    URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
    URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
    URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
    Reported-by: Martin Svec (and others)
    Tested-by: Tom Weber
    Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.")
    Cc: stable@vger.kernel.org (v4.1 and later)
    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

27 Feb, 2016

2 commits

  • Revert commit
    e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)

    The problem is raid5_get_active_stripe waits on
    conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
    in this order:
    - release all stripes with hash 0
    - raid5_get_active_stripe still sleeps since active_stripes >
    max_nr_stripes * 3 / 4
    - release all stripes with hash other than 0. active_stripes becomes 0
    - raid5_get_active_stripe still sleeps, since nobody wakes up
    wait_for_stripe[0]
    The system live locks. The problem is active_stripes isn't a per-hash
    count. Revert the patch makes the live lock go away.

    Cc: stable@vger.kernel.org (v4.2+)
    Cc: Yuanhan Liu
    Cc: NeilBrown
    Signed-off-by: Shaohua Li

    Shaohua Li
     
  • check_reshape() is called from raid5d thread. raid5d thread shouldn't
    call mddev_suspend(), because mddev_suspend() waits for all IO finish
    but IO is handled in raid5d thread, we could easily deadlock here.

    This issue is introduced by
    738a273 ("md/raid5: fix allocation of 'scribble' array.")

    Cc: stable@vger.kernel.org (v4.1+)
    Reported-and-tested-by: Artur Paszkiewicz
    Reviewed-by: NeilBrown
    Signed-off-by: Shaohua Li

    Shaohua Li
     

26 Feb, 2016

1 commit

  • 'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

    This fixes the problem where DISCARD would get disabled on some larger
    RAID5 configurations (6 or more drives in my testing), while it worked
    as expected with smaller configurations.

    Fixes: 620125f2bf8 ("MD: raid5 trim support")
    Cc: stable@vger.kernel.org v3.7+
    Signed-off-by: Jes Sorensen
    Signed-off-by: Shaohua Li

    Jes Sorensen
     

21 Jan, 2016

1 commit


06 Jan, 2016

2 commits

  • Add support for journal disk hot add/remove. Mostly trival checks in md
    part. The raid5 part is a little tricky. For hot-remove, we can't wait
    pending write as it's called from raid5d. The wait will cause deadlock.
    We simplily fail the hot-remove. A hot-remove retry can success
    eventually since if journal disk is faulty all pending write will be
    failed and finish. For hot-add, since an array supporting journal but
    without journal disk will be marked read-only, we are safe to hot add
    journal without stopping IO (should be read IO, while journal only
    handles write IO).

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown

    Shaohua Li
     
  • The stripe_add_to_batch_list() function is called only if
    stripe_can_batch() returned true, so there is no need for double check.

    Signed-off-by: Roman Gushchin
    Cc: Neil Brown
    Cc: linux-raid@vger.kernel.org
    Signed-off-by: NeilBrown

    Roman Gushchin
     

05 Nov, 2015

1 commit

  • Pull md updates from Neil Brown:
    "Two major components to this update.

    1) The clustered-raid1 support from SUSE is nearly complete. There
    are a few outstanding issues being worked on. Maybe half a dozen
    patches will bring this to a usable state.

    2) The first stage of journalled-raid5 support from Facebook makes an
    appearance. With a journal device configured (typically NVRAM or
    SSD), the "RAID5 write hole" should be closed - a crash during
    degraded operations cannot result in data corruption.

    The next stage will be to use the journal as a write-behind cache
    so that latency can be reduced and in some cases throughput
    increased by performing more full-stripe writes.

    * tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
    MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
    MD: set journal disk ->raid_disk
    MD: kick out journal disk if it's not fresh
    raid5-cache: start raid5 readonly if journal is missing
    MD: add new bit to indicate raid array with journal
    raid5-cache: IO error handling
    raid5: journal disk can't be removed
    raid5-cache: add trim support for log
    MD: fix info output for journal disk
    raid5-cache: use bio chaining
    raid5-cache: small log->seq cleanup
    raid5-cache: new helper: r5_reserve_log_entry
    raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
    raid5-cache: take rdev->data_offset into account early on
    raid5-cache: refactor bio allocation
    raid5-cache: clean up r5l_get_meta
    raid5-cache: simplify state machine when caches flushes are not needed
    raid5-cache: factor out a helper to run all stripes for an I/O unit
    raid5-cache: rename flushed_ios to finished_ios
    raid5-cache: free I/O units earlier
    ...

    Linus Torvalds