04 Jun, 2009

1 commit


27 Apr, 2009

1 commit

  • Previously, we updated a device's size prior to attempting a shrink
    operation. This patch moves the device resizing logic to only happen if
    the shrink completes successfully. In the process, it introduces a new
    field to btrfs_device -- disk_total_bytes -- to track the on-disk size.

    Signed-off-by: Chris Ball
    Signed-off-by: Chris Mason

    Chris Ball
     

21 Apr, 2009

1 commit

  • Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
    writes we plan on waiting on in the near future. This patch
    mirrors recent changes in other filesystems and the generic code to
    use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
    other latency critical writes.

    Btrfs uses async worker threads for checksumming before the write is done,
    and then again to actually submit the bios. The bio submission code just
    runs a per-device list of bios that need to be sent down the pipe.

    This list is split into low priority and high priority lists so the
    WRITE_SYNC IO happens first.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

2 commits

  • Btrfs pages being written get set to writeback, and then may go through
    a number of steps before they hit the block layer. This includes compression,
    checksumming and async bio submission.

    The end result is that someone who writes a page and then does
    wait_on_page_writeback is likely to unplug the queue before the bio they
    cared about got there.

    We could fix this by marking bios sync, or by doing more frequent unplugs,
    but this commit just changes the async bio submission code to unplug
    after it has processed all the bios for a device. The async bio submission
    does a fair job of collection bios, so this shouldn't be a huge problem
    for reducing merging at the elevator.

    For streaming O_DIRECT writes on a 5 drive array, it boosts performance
    from 386MB/s to 460MB/s.

    Thanks to Hisashi Hifumi for helping with this work.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Btrfs uses async helper threads to submit write bios so the checksumming
    helper threads don't block on the disk.

    The submit bio threads may process bios for more than one block device,
    so when they find one device congested they try to move on to other
    devices instead of blocking in get_request_wait for one device.

    This does a pretty good job of keeping multiple devices busy, but the
    congested flag has a number of problems. A congested device may still
    give you a request, and other procs that aren't backing off the congested
    device may starve you out.

    This commit uses the io_context stored in current to decide if our process
    has been made a batching process by the block layer. If so, it keeps
    sending IO down for at least one batch. This helps make sure we do
    a good amount of work each time we visit a bdev, and avoids large IO
    stalls in multi-device workloads.

    It's also very ugly. A better solution is in the works with Jens Axboe.

    Signed-off-by: Chris Mason

    Chris Mason
     

11 Mar, 2009

2 commits

  • The full flag on the space info structs tells the allocator not to try
    and allocate more chunks because the devices in the FS are fully allocated.

    When more devices are added, we need to clear the full flag so the allocator
    knows it has more space available.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Storage allocated to different raid levels in btrfs is tracked by
    a btrfs_space_info structure, and all of the current space_infos are
    collected into a list_head.

    Most filesystems have 3 or 4 of these structs total, and the list is
    only changed when new raid levels are added or at unmount time.

    This commit adds rcu locking on the list head, and properly frees
    things at unmount time. It also clears the space_info->full flag
    whenever new space is added to the FS.

    The locking for the space info list goes like this:

    reads: protected by rcu_read_lock()
    writes: protected by the chunk_mutex

    At unmount time we don't need special locking because all the readers
    are gone.

    Signed-off-by: Chris Mason

    Chris Mason
     

13 Feb, 2009

1 commit

  • Btrfs is currently using spin_lock_nested with a nested value based
    on the tree depth of the block. But, this doesn't quite work because
    the max tree depth is bigger than what spin_lock_nested can deal with,
    and because locks are sometimes taken before the level field is filled in.

    The solution here is to use lockdep_set_class_and_name instead, and to
    set the class before unlocking the pages when the block is read from the
    disk and just after init of a freshly allocated tree block.

    btrfs_clear_path_blocking is also changed to take the locks in the proper
    order, and it also makes sure all the locks currently held are properly
    set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
    gets upset about bad lock orderin.

    The lockdep magic cam from Peter Zijlstra

    Signed-off-by: Chris Mason

    Chris Mason
     

12 Feb, 2009

1 commit

  • The call to kzalloc is followed by a kmalloc whose result is stored in the
    same variable.

    The semantic match that finds the problem is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @r exists@
    local idexpression x;
    statement S;
    expression E;
    identifier f,l;
    position p1,p2;
    expression *ptr != NULL;
    @@

    (
    if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
    |
    x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
    ...
    if (x == NULL) S
    )
    }
    x->f = E
    ...>
    (
    return \(0\|\|ptr\);
    |
    return@p2 ...;
    )

    @script:python@
    p1 << r.p1;
    p2 << r.p2;
    @@

    print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Chris Mason

    Julia Lawall
     

04 Feb, 2009

1 commit


21 Jan, 2009

3 commits


17 Jan, 2009

1 commit

  • Btrfs maintains a queue of async bio submissions so the checksumming
    threads don't have to wait on get_request_wait. In order to avoid
    extra wakeups, this code has a running_pending flag that is used
    to tell new submissions they don't need to wake the thread.

    When the threads notice congestion on a single device, they
    may decide to requeue the job and move on to other devices. This
    makes sure the running_pending flag is cleared before the
    job is requeued.

    It should help avoid IO stalls by making sure the task is woken up
    when new submissions come in.

    Signed-off-by: Chris Mason

    Chris Mason
     

06 Jan, 2009

1 commit


12 Dec, 2008

1 commit

  • This patch makes seed device possible to be shared by
    multiple mounted file systems. The sharing is achieved
    by cloning seed device's btrfs_fs_devices structure.
    Thanks you,

    Signed-off-by: Yan Zheng

    Yan Zheng
     

09 Dec, 2008

4 commits

  • This adds a sequence number to the btrfs inode that is increased on
    every update. NFS will be able to use that to detect when an inode has
    changed, without relying on inaccurate time fields.

    While we're here, this also:

    Puts reserved space into the super block and inode

    Adds a log root transid to the super so we can pick the newest super
    based on the fsync log as well as the main transaction ID. For now
    the log root transid is always zero, but that'll get fixed.

    Adds a starting offset to the dev_item. This will let us do better
    alignment calculations if we know the start of a partition on the disk.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • It is possible that generic_bin_search will be called on a tree block
    that has not been locked. This happens because cache_block_block skips
    locking on the tree blocks.

    Since the tree block isn't locked, we aren't allowed to change
    the extent_buffer->map_token field. Using map_private_extent_buffer
    avoids any changes to the internal extent buffer fields.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This patch implements superblock duplication. Superblocks
    are stored at offset 16K, 64M and 256G on every devices.
    Spaces used by superblocks are preserved by the allocator,
    which uses a reverse mapping function to find the logical
    addresses that correspond to superblocks. Thank you,

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • Btrfs stores checksums for each data block. Until now, they have
    been stored in the subvolume trees, indexed by the inode that is
    referencing the data block. This means that when we read the inode,
    we've probably read in at least some checksums as well.

    But, this has a few problems:

    * The checksums are indexed by logical offset in the file. When
    compression is on, this means we have to do the expensive checksumming
    on the uncompressed data. It would be faster if we could checksum
    the compressed data instead.

    * If we implement encryption, we'll be checksumming the plain text and
    storing that on disk. This is significantly less secure.

    * For either compression or encryption, we have to get the plain text
    back before we can verify the checksum as correct. This makes the raid
    layer balancing and extent moving much more expensive.

    * It makes the front end caching code more complex, as we have touch
    the subvolume and inodes as we cache extents.

    * There is potentitally one copy of the checksum in each subvolume
    referencing an extent.

    The solution used here is to store the extent checksums in a dedicated
    tree. This allows us to index the checksums by phyiscal extent
    start and length. It means:

    * The checksum is against the data stored on disk, after any compression
    or encryption is done.

    * The checksum is stored in a central location, and can be verified without
    following back references, or reading inodes.

    This makes compression significantly faster by reducing the amount of
    data that needs to be checksummed. It will also allow much faster
    raid management code in general.

    The checksums are indexed by a key with a fixed objectid (a magic value
    in ctree.h) and offset set to the starting byte of the extent. This
    allows us to copy the checksum items into the fsync log tree directly (or
    any other tree), without having to invent a second format for them.

    Signed-off-by: Chris Mason

    Chris Mason
     

02 Dec, 2008

2 commits


20 Nov, 2008

2 commits


18 Nov, 2008

1 commit

  • Seed device is a special btrfs with SEEDING super flag
    set and can only be mounted in read-only mode. Seed
    devices allow people to create new btrfs on top of it.

    The new FS contains the same contents as the seed device,
    but it can be mounted in read-write mode.

    This patch does the following:

    1) split code in btrfs_alloc_chunk into two parts. The first part does makes
    the newly allocated chunk usable, but does not do any operation that modifies
    the chunk tree. The second part does the the chunk tree modifications. This
    division is for the bootstrap step of adding storage to the seed device.

    2) Update device management code to handle seed device.
    The basic idea is: For an FS grown from seed devices, its
    seed devices are put into a list. Seed devices are
    opened on demand at mounting time. If any seed device is
    missing or has been changed, btrfs kernel module will
    refuse to mount the FS.

    3) make btrfs_find_block_group not return NULL when all
    block groups are read-only.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

13 Nov, 2008

1 commit


08 Nov, 2008

1 commit

  • While doing a commit, btrfs makes sure all the metadata blocks
    were properly written to disk, calling wait_on_page_writeback for
    each page. This writeback happens after allowing another transaction
    to start, so it competes for the disk with other processes in the FS.

    If the page writeback bit is still set, each wait_on_page_writeback might
    trigger an unplug, even though the page might be waiting for checksumming
    to finish or might be waiting for the async work queue to submit the
    bio.

    This trades wait_on_page_writeback for waiting on the extent writeback
    bits. It won't trigger any unplugs and substantially improves performance
    in a number of workloads.

    This also changes the async bio submission to avoid requeueing if there
    is only one device. The requeue just wastes CPU time because there are
    no other devices to service.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Oct, 2008

2 commits

  • This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
    of little locks.

    There is now a pinned_mutex, which is used when messing with the pinned_extents
    extent io tree, and the extent_ins_mutex which is used with the pending_del and
    extent_ins extent io trees.

    The locking for the extent tree stuff was inspired by a patch that Yan Zheng
    wrote to fix a race condition, I cleaned it up some and changed the locking
    around a little bit, but the idea remains the same. Basically instead of
    holding the extent_ins_mutex throughout the processing of an extent on the
    extent_ins or pending_del trees, we just hold it while we're searching and when
    we clear the bits on those trees, and lock the extent for the duration of the
    operations on the extent.

    Also to keep from getting hung up waiting to lock an extent, I've added a
    try_lock_extent so if we cannot lock the extent, move on to the next one in the
    tree and we'll come back to that one. I have tested this heavily and it does
    not appear to break anything. This has to be applied on top of my
    find_free_extent redo patch.

    I tested this patch on top of Yan's space reblancing code and it worked fine.
    The only thing that has changed since the last version is I pulled out all my
    debugging stuff, apparently I forgot to run guilt refresh before I sent the
    last patch out. Thank you,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This is a large change for adding compression on reading and writing,
    both for inline and regular extents. It does some fairly large
    surgery to the writeback paths.

    Compression is off by default and enabled by mount -o compress. Even
    when the -o compress mount option is not used, it is possible to read
    compressed extents off the disk.

    If compression for a given set of pages fails to make them smaller, the
    file is flagged to avoid future compression attempts later.

    * While finding delalloc extents, the pages are locked before being sent down
    to the delalloc handler. This allows the delalloc handler to do complex things
    such as cleaning the pages, marking them writeback and starting IO on their
    behalf.

    * Inline extents are inserted at delalloc time now. This allows us to compress
    the data before inserting the inline extent, and it allows us to insert
    an inline extent that spans multiple pages.

    * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
    are changed to record both an in-memory size and an on disk size, as well
    as a flag for compression.

    From a disk format point of view, the extent pointers in the file are changed
    to record the on disk size of a given extent and some encoding flags.
    Space in the disk format is allocated for compression encoding, as well
    as encryption and a generic 'other' field. Neither the encryption or the
    'other' field are currently used.

    In order to limit the amount of data read for a single random read in the
    file, the size of a compressed extent is limited to 128k. This is a
    software only limit, the disk format supports u64 sized compressed extents.

    In order to limit the ram consumed while processing extents, the uncompressed
    size of a compressed extent is limited to 256k. This is a software only limit
    and will be subject to tuning later.

    Checksumming is still done on compressed extents, and it is done on the
    uncompressed version of the data. This way additional encodings can be
    layered on without having to figure out which encoding to checksum.

    Compression happens at delalloc time, which is basically singled threaded because
    it is usually done by a single pdflush thread. This makes it tricky to
    spread the compression load across all the cpus on the box. We'll have to
    look at parallel pdflush walks of dirty inodes at a later time.

    Decompression is hooked into readpages and it does spread across CPUs nicely.

    Signed-off-by: Chris Mason

    Chris Mason
     

04 Oct, 2008

1 commit


29 Sep, 2008

1 commit

  • btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
    The kernel code needs to wait for this IO to finish before it adds
    the device.

    btrfs metadata IO does not happen through the block device inode. A
    separate address space is used, allowing the zero filled buffer heads in
    the block device inode to be written to disk after FS metadata starts
    going down to the disk via the btrfs metadata inode.

    The end result is zero filled metadata blocks after adding new devices
    into the filesystem.

    The fix is a simple filemap_write_and_wait on the block device inode
    before actually inserting it into the pool of available devices.

    Signed-off-by: Chris Mason

    Chris Mason
     

26 Sep, 2008

2 commits

  • This patch updates the space balancing code to utilize the new
    backref format. Before, btrfs-vol -b would break any COW links
    on data blocks or metadata. This was slow and caused the amount
    of space used to explode if a large number of snapshots were present.

    The new code can keeps the sharing of all data extents and
    most of the tree blocks.

    To maintain the sharing of data extents, the space balance code uses
    a seperate inode hold data extent pointers, then updates the references
    to point to the new location.

    To maintain the sharing of tree blocks, the space balance code uses
    reloc trees to relocate tree blocks in reference counted roots.
    There is one reloc tree for each subvol, and all reloc trees share
    same root key objectid. Reloc trees are snapshots of the latest
    committed roots of subvols (root->commit_root).

    To relocate a tree block referenced by a subvol, there are two steps.
    COW the block through subvol's reloc tree, then update block pointer in
    the subvol to point to the new block. Since all reloc trees share
    same root key objectid, doing special handing for tree blocks
    owned by them is easy. Once a tree block has been COWed in one
    reloc tree, we can use the resulting new block directly when the
    same block is required to COW again through other reloc trees.
    In this way, relocated tree blocks are shared between reloc trees,
    so they are also shared between subvols.

    Signed-off-by: Chris Mason

    Zheng Yan
     
  • Btrfs had compatibility code for kernels back to 2.6.18. These have
    been removed, and will be maintained in a separate backport
    git tree from now on.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

7 commits

  • 1) replace the per fs_info extent_io_tree that tracked free space with two
    rb-trees per block group to track free space areas via offset and size. The
    reason to do this is because most allocations come with a hint byte where to
    start, so we can usually find a chunk of free space at that hint byte to satisfy
    the allocation and get good space packing. If we cannot find free space at or
    after the given offset we fall back on looking for a chunk of the given size as
    close to that given offset as possible. When we fall back on the size search we
    also try to find a slot as close to the size we want as possible, to avoid
    breaking small chunks off of huge areas if possible.

    2) remove the extent_io_tree that tracked the block group cache from fs_info and
    replaced it with an rb-tree thats tracks block group cache via offset. also
    added a per space_info list that tracks the block group cache for the particular
    space so we can lookup related block groups easily.

    3) cleaned up the allocation code to make it a little easier to read and a
    little less complicated. Basically there are 3 steps, first look from our
    provided hint. If we couldn't find from that given hint, start back at our
    original search start and look for space from there. If that fails try to
    allocate space if we can and start looking again. If not we're screwed and need
    to start over again.

    4) small fixes. there were some issues in volumes.c where we wouldn't allocate
    the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
    which has helped a good bit in making the fs_mark test I run have semi-normal
    results as we run out of space. Generally with data allocations we don't track
    where we last allocated from, so everytime we did a data allocation we'd search
    through every block group that we have looking for free space. Now searching a
    block group with no free space isn't terribly time consuming, it was causing a
    slight degradation as we got more data block groups. The alloc_hint has fixed
    this slight degredation and made things semi-normal.

    There is still one nagging problem I'm working on where we will get ENOSPC when
    there is definitely plenty of space. This only happens with metadata
    allocations, and only when we are almost full. So you generally hit the 85%
    mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
    still tracking it down, but until then this seems to be pretty stable and make a
    significant performance gain.

    Signed-off-by: Chris Mason

    Josef Bacik
     
  • ---

    Signed-off-by: Chris Mason

    Zheng Yan
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • The current code waits for the count of async bio submits to get below
    a given threshold if it is too high right after adding the latest bio
    to the work queue. This isn't optimal because the caller may have
    sequential adjacent bios pending they are waiting to send down the pipe.

    This changeset requires the caller to wait on the async bio count,
    and changes the async checksumming submits to wait for async bios any
    time they self throttle.

    The end result is much higher sequential throughput.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Before, the btrfs bdi congestion function was used to test for too many
    async bios. This keeps that check to throttle pdflush, but also
    adds a check while queuing bios.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • The multi-bio code is responsible for duplicating blocks in raid1 and
    single spindle duplication. It has counters to make sure all of
    the locations for a given extent are properly written before io completion
    is returned to the higher layers.

    But, it didn't always complete the same bio it was given, sometimes a
    clone was completed instead. This lead to problems with the async
    work queues because they saved a pointer to the bio in a struct off
    bi_private.

    The fix is to remember the original bio and only complete that one.

    Signed-off-by: Chris Mason

    Chris Mason