02 Jul, 2013

1 commit

  • When adjusting the enospc rules for relocation I ran into a deadlock because we
    were relocating the only system chunk and that forced us to try and allocate a
    new system chunk while holding locks in the chunk tree, which caused us to
    deadlock. To fix this I've moved all of the dev extent addition and chunk
    addition out to the delayed chunk completion stuff. We still keep the in-memory
    stuff which makes sure everything is consistent.

    One change I had to make was to search the commit root of the device tree to
    find a free dev extent, and hold onto any chunk em's that we allocated in that
    transaction so we do not allocate the same dev extent twice. This has the side
    effect of fixing a bug with balance that has been there ever since balance
    existed. Basically you can free a block group and it's dev extent and then
    immediately allocate that dev extent for a new block group and write stuff to
    that dev extent, all within the same transaction. So if you happen to crash
    during a balance you could come back to a completely broken file system. This
    patch should keep these sort of things from happening in the future since we
    won't be able to allocate free'd dev extents until after the transaction
    commits. This has passed all of the xfstests and my super annoying stress test
    followed by a balance. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

14 Jun, 2013

1 commit

  • There are several functions whose code is similar, such as
    btrfs_find_last_root()
    btrfs_read_fs_root_no_radix()

    Besides that, some functions are invoked twice, it is unnecessary,
    for example, we are sure that all roots which is found in
    btrfs_find_orphan_roots()
    have their orphan items, so it is unnecessary to check the orphan
    item again.

    So cleanup it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

18 May, 2013

1 commit

  • Btrfs has been pointer tagging bi_private and using bi_bdev
    to store the stripe index and mirror number of failed IOs.

    As bios bubble back up through the call chain, we use these
    to decide if and how to retry our IOs. They are also used
    to count IO failures on a per device basis.

    Recently a bio tracepoint was added lead to crashes because
    we were abusing bi_bdev.

    This commit adds a btrfs bioset, and creates explicit fields
    for the mirror number and stripe index. The plan is to
    extend this structure for all of the fields currently in
    struct btrfs_bio, which will mean one less kmalloc in
    our IO path.

    Signed-off-by: Chris Mason
    Reported-by: Tejun Heo

    Chris Mason
     

07 May, 2013

1 commit

  • Big patch, but all it does is add statics to functions which
    are in fact static, then remove the associated dead-code fallout.

    removed functions:

    btrfs_iref_to_path()
    __btrfs_lookup_delayed_deletion_item()
    __btrfs_search_delayed_insertion_item()
    __btrfs_search_delayed_deletion_item()
    find_eb_for_page()
    btrfs_find_block_group()
    range_straddles_pages()
    extent_range_uptodate()
    btrfs_file_extent_length()
    btrfs_scrub_cancel_devid()
    btrfs_start_transaction_lflush()

    btrfs_print_tree() is left because it is used for debugging.
    btrfs_start_transaction_lflush() and btrfs_reada_detach() are
    left for symmetry.

    ulist.c functions are left, another patch will take care of those.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Josef Bacik

    Eric Sandeen
     

21 Feb, 2013

1 commit


20 Feb, 2013

1 commit


02 Feb, 2013

1 commit

  • This builds on David Woodhouse's original Btrfs raid5/6 implementation.
    The code has changed quite a bit, blame Chris Mason for any bugs.

    Read/modify/write is done after the higher levels of the filesystem have
    prepared a given bio. This means the higher layers are not responsible
    for building full stripes, and they don't need to query for the topology
    of the extents that may get allocated during delayed allocation runs.
    It also means different files can easily share the same stripe.

    But, it does expose us to incorrect parity if we crash or lose power
    while doing a read/modify/write cycle. This will be addressed in a
    later commit.

    Scrub is unable to repair crc errors on raid5/6 chunks.

    Discard does not work on raid5/6 (yet)

    The stripe size is fixed at 64KiB per disk. This will be tunable
    in a later commit.

    Signed-off-by: Chris Mason

    David Woodhouse
     

17 Dec, 2012

1 commit


13 Dec, 2012

9 commits

  • This commit contains all the essential changes to the core code
    of Btrfs for support of the device replace procedure.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This adds a new file to the sources together with the header file
    and the changes to ioctl.h and ctree.h that are required by the
    new C source file. Additionally, 4 new functions are added to
    volume.c that deal with device creation and destruction.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This patch adds some code to disallow operations on the device that
    is used as the target for the device replace operation.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • A small number of functions that are used in a device replace
    procedure when the operation is resumed at mount time are unable
    to pass the same root pointer that would be used in the regular
    (ioctl) context. And since the root pointer is not required, only
    the fs_info is, the root pointer argument is replaced with the
    fs_info pointer argument.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This new function is used by the device replace procedure in
    a later patch.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This is required for the device replace procedure in a later step.
    Two calling functions also had to be changed to have the fs_info
    pointer: repair_io_failure() and scrub_setup_recheck_block().

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This is required for the device replace procedure in a later step.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • The new function btrfs_find_device_missing_or_by_path() will be
    used for the device replace procedure. This function itself calls
    the second new function btrfs_find_device_by_path().
    Unfortunately, it is not possible to currently make the rest of the
    code use these functions as well, since all functions that look
    similar at first view are all a little bit different in what they
    are doing. But in the future, new code could benefit from these
    two new functions, and currently, device replace uses them.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • The device replace procedure makes use of the scrub code. The scrub
    code is the most efficient code to read the allocated data of a disk,
    i.e. it reads sequentially in order to avoid disk head movements, it
    skips unallocated blocks, it uses read ahead mechanisms, and it
    contains all the code to detect and repair defects.
    This commit is a first preparation step to adapt the scrub code to
    be shareable for the device replace procedure.
    The block device will be removed from the scrub context state
    structure in a later step. It used to be the source block device.
    The scrub code as it is used for the device replace procedure reads
    the source data from whereever it is optimal. The source device might
    even be gone (disconnected, for instance due to a hardware failure).
    Or the drive can be so faulty so that the device replace procedure
    tries to avoid access to the faulty source drive as much as possible,
    and only if all other mirrors are damaged, as a last resort, the
    source disk is accessed.
    The modified scrub code operates as if it would handle the source
    drive and thereby generates an exact copy of the source disk on the
    target disk, even if the source disk is not present at all. Therefore
    the block device pointer to the source disk is removed in a later
    patch, and therefore the context structure is renamed (this is the
    goal of the current patch) to reflect that no source block device
    scope is there anymore.

    Summary:
    This first preparation step consists of a textual substitution of the
    term "dev" to the term "ctx" whereever the scrub context is used.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     

29 Aug, 2012

1 commit

  • Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device
    statistic counters for detected IO and checksum errors to Linux 3.5.
    The statistic part that counts checksum errors in
    end_bio_extent_readpage() can cause a BUG() in a subfunction:
    "kernel BUG at fs/btrfs/volumes.c:3762!"
    That part is reverted with the current patch.
    However, the counting of checksum errors in the scrub context remains
    active, and the counting of detected IO errors (read, write or flush
    errors) in all contexts remains active.

    Cc: stable # 3.5
    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     

24 Jul, 2012

2 commits

  • This will be used in conjunction with btrfs device ready . This is
    needed for initrd's to have a nice and lightweight way to tell if all of the
    devices needed for a file system are in the cache currently. This keeps
    them from having to do mount+sleep loops waiting for devices to show up.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Commit c11d2c236cc260b36 (Btrfs: add ioctl to get and reset the device
    stats) introduced two ioctls doing almost the same thing distinguished
    by just the ioctl number which encodes "do reset after read". I have
    suggested

    http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

    to implement it via the ioctl args. This hasn't happen, and I think we
    should use a more clean way to pass flags and should not waste ioctl
    numbers.

    CC: Stefan Behrens
    Signed-off-by: David Sterba

    David Sterba
     

03 Jul, 2012

2 commits

  • This introduces btrfs_resume_balance_async(), which, given that
    restriper state was recovered earlier by btrfs_recover_balance(),
    resumes balance in btrfs-balance kthread.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Fix a bug that triggered asserts in btrfs_balance() in both normal and
    resume modes -- restriper state was not properly restored on read-only
    mounts. This factors out resuming code from btrfs_restore_balance(),
    which is now also called earlier in the mount sequence to avoid the
    problem of some early writes getting the old profile.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

15 Jun, 2012

1 commit

  • Al pointed out that we can just toss out the old name on a device and add a
    new one arbitrarily, so anybody who uses device->name in printk could
    possibly use free'd memory. Instead of adding locking around all of this he
    suggested doing it with RCU, so I've introduced a struct rcu_string that
    does just that and have gone through and protected all accesses to
    device->name that aren't under the uuid_mutex with rcu_read_lock(). This
    protects us and I will use it for dealing with removing the device that we
    used to mount the file system in a later patch. Thanks,

    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     

30 May, 2012

3 commits


22 Mar, 2012

1 commit


17 Jan, 2012

13 commits

  • Conflicts:
    fs/btrfs/volumes.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for canceling restriper. Currently we wait until
    relocation of the current block group is finished, in future this can be
    done by triggering a commit. Balance item is deleted and no memory
    about the interrupted balance is kept.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for pausing restriper. This pauses the relocation,
    but balance is still considered to be "in progress": balance item is
    not deleted, other volume operations cannot be started, etc. If paused
    in the middle of profile changing operation we will continue making
    allocations with the target profile.

    Add a hook to close_ctree() to pause restriper and free its data
    structures on unmount. (It's safe to unmount when restriper is in
    "paused" state, we will resume with the same parameters on the next
    mount)

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • On mount, if balance item is found, resume balance in a separate
    kernel thread.

    Try to be smart to continue roughly where previous balance (or convert)
    was interrupted. For chunk types that were being converted to some
    profile we turn on soft convert, in case of a simple balance we turn on
    usage filter and relocate only less-than-90%-full chunks of that type.
    These are just heuristics but they help quite a bit, and can be improved
    in future.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • When doing convert from one profile to another if soft mode is on
    restriper won't touch chunks that already have the profile we are
    converting to. This is useful if e.g. half of the FS was converted
    earlier.

    The soft mode switch is (like every other filter) per-type. This means
    that we can convert for example meta chunks the "hard" way while
    converting data chunks selectively with soft switch.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Profile changing is done by launching a balance with
    BTRFS_BALANCE_CONVERT bits set and target fields of respective
    btrfs_balance_args structs initialized. Profile reducing code in this
    case will pick restriper's target profile if it's available instead of
    doing a blind reduce. If target profile is not yet available it goes
    back to a plain reduce.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks which have at least one byte located inside a given
    [vstart, vend) virtual address space range.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks which have at least one byte of at least one stripe
    located on a device with devid X in a given [pstart,pend) physical
    address range.

    This filter only works when devid filter is turned on.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Relocate chunks which have at least one stripe located on a device with
    devid X.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks that are less than X percent full.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks based on a given profile mask.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • This allows to have a separate set of filters for each chunk type
    (data,meta,sys). The code however is generic and switch on chunk type
    is only done once.

    This commit also adds a type filter: it allows to balance for example
    meta and system chunks w/o touching data ones.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov