25 Apr, 2011

4 commits

  • This is similar to block group caching.

    We dedicate a special inode in fs tree to save free ino cache.

    At the very first time we create/delete a file after mount, the free ino
    cache will be loaded from disk into memory. When the fs tree is commited,
    the cache will be written back to disk.

    To keep compatibility, we check the root generation against the generation
    of the special inode when loading the cache, so the loading will fail
    if the btrfs filesystem was mounted in an older kernel before.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Currently btrfs stores the highest objectid of the fs tree, and it always
    returns (highest+1) inode number when we create a file, so inode numbers
    won't be reclaimed when we delete files, so we'll run out of inode numbers
    as we keep create/delete files in 32bits machines.

    This fixes it, and it works similarly to how we cache free space in block
    cgroups.

    We start a kernel thread to read the file tree. By scanning inode items,
    we know which chunks of inode numbers are free, and we cache them in
    an rb-tree.

    Because we are searching the commit root, we have to carefully handle the
    cross-transaction case.

    The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
    chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
    of extents, and a bitmap will be used if we exceed this threshold. The
    extents threshold is adjusted in runtime.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • So we can re-use the code to cache free inode numbers.

    The change is quite straightforward. Two new structures are introduced.

    - struct btrfs_free_space_ctl

    We move those variables that are used for caching free space from
    struct btrfs_block_group_cache to this new struct.

    - struct btrfs_free_space_op

    We do block group specific work (e.g. calculation of extents threshold)
    through functions registered in this struct.

    And then we can remove references to struct btrfs_block_group_cache.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • We've already recorded the value in block_group->frees_space.

    Signed-off-by: Li Zefan

    Li Zefan
     

28 Mar, 2011

1 commit

  • We take an free extent out from allocator, trim it, then put it back,
    but before we trim the block group, we should make sure the block group is
    cached, so plus a little change to make cache_block_group() run without a
    transaction.

    Signed-off-by: Li Dongyang
    Signed-off-by: Chris Mason

    Li Dongyang
     

29 Oct, 2010

3 commits

  • This patch actually loads the free space cache if it exists. The only thing
    that really changes here is that we need to cache the block group if we're going
    to remove an extent from it. Previously we did not do this since the caching
    kthread would pick it up. With the on disk cache we don't have this luxury so
    we need to make sure we read the on disk cache in first, and then remove the
    extent, that way when the extent is unpinned the free space is added to the
    block group. This has been tested with all sorts of things.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This is a simple bit, just dump the free space cache out to our preallocated
    inode when we're writing out dirty block groups. There are a bunch of changes
    in inode.c in order to account for special cases. Mostly when we're doing the
    writeout we're holding trans_mutex, so we need to use the nolock transacation
    functions. Also we can't do asynchronous completions since the async thread
    could be blocked on already completed IO waiting for the transaction lock. This
    has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
    tests. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • In order to save free space cache, we need an inode to hold the data, and we
    need a special item to point at the right inode for the right block group. So
    first, create a special item that will point to the right inode, and the number
    of extent entries we will have and the number of bitmaps we will have. We
    truncate and pre-allocate space everytime to make sure it's uptodate.

    This feature will be turned on as soon as you mount with -o space_cache, however
    it is safe to boot into old kernels, they will just generate the cache the old
    fashion way. When you boot back into a newer kernel we will notice that we
    modified and not the cache and automatically discard the cache.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

24 Jul, 2009

1 commit

  • Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
    tracking free space. As free space gets fragmented, we end up with thousands of
    entries on an rb-tree per block group, which usually spans 1 gig of area. Since
    we currently don't ever flush free space cache back to disk this gets to be a
    bit unweildly on large fs's with lots of fragmentation.

    This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
    space cache. Initially we calculate a threshold of extent entries we can
    handle, which is however many extent entries we can cram into 16k of ram. The
    maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
    will be 32k of RAM, which scales much better than we did before.

    Once we pass the extent threshold, we start adding bitmaps and using those
    instead for tracking the free space. This patch also makes it so that any free
    space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
    nice since we try and allocate out of the front of a block group, so if the
    front of a block group is heavily fragmented and then has a huge chunk of free
    space at the end, we go ahead and add the fragmented areas to bitmaps and use a
    normal extent entry to track the big chunk at the back of the block group.

    I've also taken the opportunity to revamp how we search for free space.
    Previously we indexed free space via an offset indexed rb tree and a bytes
    indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
    indexed rb tree. This cuts the number of tree operations we were doing
    previously down by half, and gives us a little bit of a better allocation
    pattern since we will always start from a specific offset and search forward
    from there, instead of searching for the size we need and try and get it as
    close as possible to the offset we want.

    I've given this a healthy amount of testing pre-new format stuff, as well as
    post-new format stuff. I've booted up my fedora box which is installed on btrfs
    with this patch and ran with it for a few days without issues. I've not seen
    any performance regressions in any of my tests.

    Since the last patch Yan Zheng fixed a problem where we could have overlapping
    entries, so updating their offset inline would cause problems. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

10 Jun, 2009

1 commit

  • Some SSDs perform best when reusing block numbers often, while
    others perform much better when clustering strictly allocates
    big chunks of unused space.

    The default mount -o ssd will find rough groupings of blocks
    where there are a bunch of free blocks that might have some
    allocated blocks mixed in.

    mount -o ssd_spread will make sure there are no allocated blocks
    mixed in. It should perform better on lower end SSDs.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

1 commit

  • Because btrfs is copy-on-write, we end up picking new locations for
    blocks very often. This makes it fairly difficult to maintain perfect
    read patterns over time, but we can at least do some optimizations
    for writes.

    This is done today by remembering the last place we allocated and
    trying to find a free space hole big enough to hold more than just one
    allocation. The end result is that we tend to write sequentially to
    the drive.

    This happens all the time for metadata and it happens for data
    when mounted -o ssd. But, the way we record it is fairly racey
    and it tends to fragment the free space over time because we are trying
    to allocate fairly large areas at once.

    This commit gets rid of the races by adding a free space cluster object
    with dedicated locking to make sure that only one process at a time
    is out replacing the cluster.

    The free space fragmentation is somewhat solved by allowing a cluster
    to be comprised of smaller free space extents. This part definitely
    adds some CPU time to the cluster allocations, but it allows the allocator
    to consume the small holes left behind by cow.

    Signed-off-by: Chris Mason

    Chris Mason