13 Mar, 2019

1 commit

  • When the Partial Parity Log is enabled, circular buffer is used to store
    PPL data. Each write to RAID device causes overwrite of data in this buffer
    so some write_hint can be set to those request to help drives handle
    garbage collection. This patch adds new sysfs attribute which can be used
    to specify which write_hint should be assigned to PPL.

    Acked-by: Guoqing Jiang
    Signed-off-by: Mariusz Dabrowski
    Signed-off-by: Song Liu

    Mariusz Dabrowski
     

01 Sep, 2018

1 commit

  • We don't support reshape yet if an array supports log device. Previously we
    determine the fact by checking ->log. However, ->log could be NULL after a log
    device is removed, but the array is still marked to support log device. Don't
    allow reshape in this case too. User can disable log device support by setting
    'consistency_policy' to 'resync' then do reshape.

    Reported-by: Xiao Ni
    Tested-by: Xiao Ni
    Signed-off-by: Shaohua Li

    Shaohua Li
     

22 Feb, 2018

1 commit


16 Jan, 2018

1 commit

  • In order to provide data consistency with PPL for disks with write-back
    cache enabled all data has to be flushed to disks before next PPL
    entry. The disks to be flushed are marked in the bitmap. It's modified
    under a mutex and it's only read after PPL io unit is submitted.

    A limitation of 64 disks in the array has been introduced to keep data
    structures and implementation simple. RAID5 arrays with so many disks are
    not likely due to high risk of multiple disks failure. Such restriction
    should not be a real life limitation.

    With write-back cache disabled next PPL entry is submitted when data write
    for current one completes. Data flush defers next log submission so trigger
    it when there are no stripes for handling found.

    As PPL assures all data is flushed to disk at request completion, just
    acknowledge flush request when PPL is enabled.

    Signed-off-by: Tomasz Majchrzak
    Signed-off-by: Shaohua Li

    Tomasz Majchrzak
     

12 Dec, 2017

1 commit

  • In do_md_run(), md threads should not wake up until the array is fully
    initialized in md_run(). However, in raid5_run(), raid5-cache may wake
    up mddev->thread to flush stripes that need to be written back. This
    design doesn't break badly right now. But it could lead to bad bug in
    the future.

    This patch tries to resolve this problem by splitting start up work
    into two personality functions, run() and start(). Tasks that do not
    require the md threads should go into run(), while task that require
    the md threads go into start().

    r5l_load_log() is moved to raid5_start(), so it is not called until
    the md threads are started in do_md_run().

    Signed-off-by: Song Liu
    Signed-off-by: Shaohua Li

    Song Liu
     

15 Nov, 2017

1 commit

  • Pull MD update from Shaohua Li:
    "This update mostly includes bug fixes:

    - md-cluster now supports raid10 from Guoqing

    - raid5 PPL fixes from Artur

    - badblock regression fix from Bo

    - suspend hang related fixes from Neil

    - raid5 reshape fixes from Neil

    - raid1 freeze deadlock fix from Nate

    - memleak fixes from Zdenek

    - bitmap related fixes from Me and Tao

    - other fixes and cleanups"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
    md: free unused memory after bitmap resize
    md: release allocated bitset sync_set
    md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
    md: be cautious about using ->curr_resync_completed for ->recovery_offset
    badblocks: fix wrong return value in badblocks_set if badblocks are disabled
    md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
    md-cluster: update document for raid10
    md: remove redundant variable q
    raid1: remove obsolete code in raid1_write_request
    md-cluster: Use a small window for raid10 resync
    md-cluster: Suspend writes in RAID10 if within range
    md-cluster/raid10: set "do_balance = 0" if area is resyncing
    md: use lockdep_assert_held
    raid1: prevent freeze_array/wait_all_barriers deadlock
    md: use TASK_IDLE instead of blocking signals
    md: remove special meaning of ->quiesce(.., 2)
    md: allow metadata update while suspending.
    md: use mddev_suspend/resume instead of ->quiesce()
    md: move suspend_hi/lo handling into core md code
    md: don't call bitmap_create() while array is quiesced.
    ...

    Linus Torvalds
     

02 Nov, 2017

2 commits

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • The '2' argument means "wake up anything that is waiting".
    This is an inelegant part of the design and was added
    to help support management of suspend_lo/suspend_hi setting.
    Now that suspend_lo/hi is managed in mddev_suspend/resume,
    that need is gone.
    These is still a couple of places where we call 'quiesce'
    with an argument of '2', but they can safely be changed to
    call ->quiesce(.., 1); ->quiesce(.., 0) which
    achieve the same result at the small cost of pausing IO
    briefly.

    This removes a small "optimization" from suspend_{hi,lo}_store,
    but it isn't clear that optimization served a useful purpose.
    The code now is a lot clearer.

    Suggested-by: Shaohua Li
    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

12 May, 2017

1 commit

  • For the raid456 with writeback cache, when journal device failed during
    normal operation, it is still possible to persist all data, as all
    pending data is still in stripe cache. However, it is necessary to handle
    journal failure gracefully.

    During journal failures, the following logic handles the graceful shutdown
    of journal:
    1. raid5_error() marks the device as Faulty and schedules async work
    log->disable_writeback_work;
    2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
    suspended, set to write through, and then resumed. mddev_suspend()
    flushes all cached stripes;
    3. All cached stripes need to be flushed carefully to the RAID array.

    This patch fixes issues within the process above:
    1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
    journal failures;
    2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
    since raid5_error() updates superblock.
    3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
    to make progress during log_failed;
    4. In delay_towrite(), if log failed only process data in the cache (skip
    new writes in dev->towrite);
    5. In __get_priority_stripe(), process loprio_list during journal device
    failures.
    6. In raid5_remove_disk(), wait for all cached stripes are flushed before
    calling log_exit().

    Signed-off-by: Song Liu
    Signed-off-by: Shaohua Li

    Song Liu
     

11 Apr, 2017

1 commit

  • Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
    or free sh->ppl_page at runtime for all stripes in the stripe cache.
    raid5_reset_stripe_cache() required suspending the mddev and could
    deadlock because of GFP_KERNEL allocations.

    Move the 'newsize' check to check_reshape() to allow reallocating the
    stripes with the same number of disks. Allocate sh->ppl_page in
    alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
    a parameter to alloc_stripe() because it is needed to check whether to
    allocate ppl_page. Add free_stripe() and use it to free stripes rather
    than directly call kmem_cache_free(). Also free sh->ppl_page in
    free_stripe().

    Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
    setting it in advance and add another parameter to log_init() to allow
    calling ppl_init_log() without the bit set. Don't try to calculate
    partial parity or add a stripe to log if it does not have ppl_page set.

    Enabling ppl can now be performed without suspending the mddev, because
    the log won't be used until new stripes are allocated with ppl_page.
    Calling mddev_suspend/resume is still necessary when disabling ppl,
    because we want all stripes to finish before stopping the log, but
    resize_stripes() can be called after mddev_resume() when ppl is no
    longer active.

    Suggested-by: NeilBrown
    Signed-off-by: Artur Paszkiewicz
    Signed-off-by: Shaohua Li

    Artur Paszkiewicz
     

23 Mar, 2017

1 commit

  • We currently gather bios that need to be returned into a bio_list
    and call bio_endio() on them all together.
    The original reason for this was to avoid making the calls while
    holding a spinlock.
    Locking has changed a lot since then, and that reason is no longer
    valid.

    So discard return_io() and various return_bi lists, and just call
    bio_endio() directly as needed.

    Signed-off-by: NeilBrown
    Signed-off-by: Shaohua Li

    NeilBrown
     

17 Mar, 2017

3 commits

  • Add a function to modify the log by removing an rdev when a drive fails
    or adding when a spare/replacement is activated as a raid member.

    Removing a disk just clears the child log rdev pointer. No new stripes
    will be accepted for this child log in ppl_write_stripe() and running io
    units will be processed without writing PPL to the device.

    Adding a disk sets the child log rdev pointer and writes an empty PPL
    header.

    Signed-off-by: Artur Paszkiewicz
    Signed-off-by: Shaohua Li

    Artur Paszkiewicz
     
  • Implement the calculation of partial parity for a stripe and PPL write
    logging functionality. The description of PPL is added to the
    documentation. More details can be found in the comments in raid5-ppl.c.

    Attach a page for holding the partial parity data to stripe_head.
    Allocate it only if mddev has the MD_HAS_PPL flag set.

    Partial parity is the xor of not modified data chunks of a stripe and is
    calculated as follows:

    - reconstruct-write case:
    xor data from all not updated disks in a stripe

    - read-modify-write case:
    xor old data and parity from all updated disks in a stripe

    Implement it using the async_tx API and integrate into raid_run_ops().
    It must be called when we still have access to old data, so do it when
    STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
    stored into sh->ppl_page.

    Partial parity is not meaningful for full stripe write and is not stored
    in the log or used for recovery, so don't attempt to calculate it when
    stripe has STRIPE_FULL_WRITE.

    Put the PPL metadata structures to md_p.h because userspace tools
    (mdadm) will also need to read/write PPL.

    Warn about using PPL with enabled disk volatile write-back cache for
    now. It can be removed once disk cache flushing before writing PPL is
    implemented.

    Signed-off-by: Artur Paszkiewicz
    Signed-off-by: Shaohua Li

    Artur Paszkiewicz
     
  • Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
    wrappers for functions which will be shared with ppl and use them in
    raid5 core instead of direct calls to raid5-cache.

    Remove unused parameter from r5c_cache_data(), move two duplicated
    pr_debug() calls to r5l_init_log().

    Signed-off-by: Artur Paszkiewicz
    Signed-off-by: Shaohua Li

    Artur Paszkiewicz