27 Sep, 2016

1 commit

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 May, 2016

1 commit


07 Jan, 2016

1 commit


22 Oct, 2015

1 commit

  • We can waste a lot of time searching through bitmaps when we are heavily
    fragmented trying to find large contiguous areas that don't exist in the bitmap.
    So keep track of the max extent size when we do a full search of a bitmap so
    that next time around we can just skip the expensive searching if our max size
    is less than what we are looking for. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

11 Apr, 2015

2 commits

  • We loop through all of the dirty block groups during commit and write
    the free space cache. In order to make sure the cache is currect, we do
    this while no other writers are allowed in the commit.

    If a large number of block groups are dirty, this can introduce long
    stalls during the final stages of the commit, which can block new procs
    trying to change the filesystem.

    This commit changes the block group cache writeout to take appropriate
    locks and allow it to run earlier in the commit. We'll still have to
    redo some of the block groups, but it means we can get most of the work
    out of the way without blocking the entire FS.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Block group cache writeout is currently waiting on the pages for each
    block group cache before moving on to writing the next one. This commit
    switches things around to send down all the caches and then wait on them
    in batches.

    The end result is much faster, since we're keeping the disk pipeline
    full.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Dec, 2014

1 commit

  • Trimming is completely transactionless, and the way it operates consists
    of hiding free space entries from a block group, perform the trim/discard
    and then make the free space entries visible again.
    Therefore while a free space entry is being trimmed, we can have free space
    cache writing running in parallel (as part of a transaction commit) which
    will miss the free space entry. This means that an unmount (or crash/reboot)
    after that transaction commit and mount again before another transaction
    starts/commits after the discard finishes, we will have some free space
    that won't be used again unless the free space cache is rebuilt. After the
    unmount, fsck (btrfsck, btrfs check) reports the issue like the following
    example:

    *** fsck.btrfs output ***
    checking extents
    checking free space cache
    There is no free space entry for 521764864-521781248
    There is no free space entry for 521764864-1103101952
    cache appears valid but isnt 29360128
    Checking filesystem on /dev/sdc
    UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
    found 1235681286 bytes used err is -22
    (...)

    Another issue caused by this race is a crash while writing bitmap entries
    to the cache, because while the cache writeout task accesses the bitmaps,
    the trim task can be concurrently modifying the bitmap or worse might
    be freeing the bitmap. The later case results in the following crash:

    [55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    [55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
    [55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
    [55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
    [55650.807166] RIP: 0010:[] [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
    [55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
    [55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
    [55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
    [55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
    [55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
    [55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
    [55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
    [55650.808017] Stack:
    [55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
    [55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
    [55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
    [55650.808017] Call Trace:
    [55650.808017] [] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
    [55650.808017] [] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
    [55650.808017] [] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
    [55650.808017] [] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
    [55650.808017] [] ? _raw_spin_lock+0xe/0x10
    [55650.808017] [] btrfs_commit_transaction+0x411/0x882 [btrfs]
    [55650.808017] [] transaction_kthread+0xf2/0x1a4 [btrfs]
    [55650.808017] [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
    [55650.808017] [] kthread+0xb7/0xbf
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] [] ret_from_fork+0x7c/0xb0
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
    [55650.808017] RIP [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.808017] RSP
    [55650.815725] ---[ end trace 1c032e96b149ff86 ]---

    Fix this by serializing both tasks in such a way that cache writeout
    doesn't wait for the trim/discard of free space entries to finish and
    doesn't miss any free space entry.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

12 Nov, 2013

2 commits


21 Sep, 2013

1 commit

  • By the current code, if the requested size is very large, and all the extents
    in the free space cache are small, we will waste lots of the cpu time to cut
    the requested size in half and search the cache again and again until it gets
    down to the size the allocator can return. In fact, we can know the max extent
    size in the cache after the first search, so we needn't cut the size in half
    repeatedly, and just use the max extent size directly. This way can save
    lots of cpu time and make the performance grow up when there are only fragments
    in the free space cache.

    According to my test, if there are only 4KB free space extents in the fs,
    and the total size of those extents are 256MB, we can reduce the execute
    time of the following test from 5.4s to 1.4s.
    dd if=/dev/zero of= bs=1MB count=1 oflag=sync

    Changelog v2 -> v3:
    - fix the problem that we skip the block group with the space which is
    less than we need.

    Changelog v1 -> v2:
    - address the problem that we return a wrong start position when searching
    the free space in a bitmap.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     

01 Sep, 2013

2 commits

  • The plan is to have a bunch of unit tests that run when btrfs is loaded when you
    build with the appropriate config option. My ultimate goal is to have a test
    for every non-static function we have, but at first I'm going to focus on the
    things that cause us the most problems. To start out with this just adds a
    tests/ directory and moves the existing free space cache tests into that
    directory and sets up all of the infrastructure. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I noticed while looking at a deadlock that we are always starting a transaction
    in cow_file_range(). This isn't really needed since we only need a transaction
    if we are doing an inline extent, or if the allocator needs to allocate a chunk.
    So push down all the transaction start stuff to be closer to where we actually
    need a transaction in all of these cases. This will hopefully reduce our write
    latency when we are committing often. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

14 Jun, 2013

1 commit


18 May, 2013

1 commit

  • It is very likely that there are lots of subvolumes/snapshots in the filesystem,
    so if we use global block reservation to do inode cache truncation, we may hog
    all the free space that is reserved in global rsv. So it is better that we do
    the free space reservation for inode cache truncation by ourselves.

    Cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

07 May, 2013

1 commit

  • We keep hitting bugs in the tree log replay because btrfs_remove_free_space
    doesn't account for some corner case. So add a bunch of tests to try and fully
    test btrfs_remove_free_space since the only time it is called is during tree log
    replay. These tests all finish successfully, so as we find more of these bugs
    we need to add to these tests to make sure we don't regress in fixing things.
    I've hidden the tests behind a Kconfig option, but they take no time to run so
    all btrfs developers should have this turned on all the time. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 Apr, 2011

4 commits

  • This is similar to block group caching.

    We dedicate a special inode in fs tree to save free ino cache.

    At the very first time we create/delete a file after mount, the free ino
    cache will be loaded from disk into memory. When the fs tree is commited,
    the cache will be written back to disk.

    To keep compatibility, we check the root generation against the generation
    of the special inode when loading the cache, so the loading will fail
    if the btrfs filesystem was mounted in an older kernel before.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Currently btrfs stores the highest objectid of the fs tree, and it always
    returns (highest+1) inode number when we create a file, so inode numbers
    won't be reclaimed when we delete files, so we'll run out of inode numbers
    as we keep create/delete files in 32bits machines.

    This fixes it, and it works similarly to how we cache free space in block
    cgroups.

    We start a kernel thread to read the file tree. By scanning inode items,
    we know which chunks of inode numbers are free, and we cache them in
    an rb-tree.

    Because we are searching the commit root, we have to carefully handle the
    cross-transaction case.

    The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
    chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
    of extents, and a bitmap will be used if we exceed this threshold. The
    extents threshold is adjusted in runtime.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • So we can re-use the code to cache free inode numbers.

    The change is quite straightforward. Two new structures are introduced.

    - struct btrfs_free_space_ctl

    We move those variables that are used for caching free space from
    struct btrfs_block_group_cache to this new struct.

    - struct btrfs_free_space_op

    We do block group specific work (e.g. calculation of extents threshold)
    through functions registered in this struct.

    And then we can remove references to struct btrfs_block_group_cache.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • We've already recorded the value in block_group->frees_space.

    Signed-off-by: Li Zefan

    Li Zefan
     

28 Mar, 2011

1 commit

  • We take an free extent out from allocator, trim it, then put it back,
    but before we trim the block group, we should make sure the block group is
    cached, so plus a little change to make cache_block_group() run without a
    transaction.

    Signed-off-by: Li Dongyang
    Signed-off-by: Chris Mason

    Li Dongyang
     

29 Oct, 2010

3 commits

  • This patch actually loads the free space cache if it exists. The only thing
    that really changes here is that we need to cache the block group if we're going
    to remove an extent from it. Previously we did not do this since the caching
    kthread would pick it up. With the on disk cache we don't have this luxury so
    we need to make sure we read the on disk cache in first, and then remove the
    extent, that way when the extent is unpinned the free space is added to the
    block group. This has been tested with all sorts of things.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This is a simple bit, just dump the free space cache out to our preallocated
    inode when we're writing out dirty block groups. There are a bunch of changes
    in inode.c in order to account for special cases. Mostly when we're doing the
    writeout we're holding trans_mutex, so we need to use the nolock transacation
    functions. Also we can't do asynchronous completions since the async thread
    could be blocked on already completed IO waiting for the transaction lock. This
    has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
    tests. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • In order to save free space cache, we need an inode to hold the data, and we
    need a special item to point at the right inode for the right block group. So
    first, create a special item that will point to the right inode, and the number
    of extent entries we will have and the number of bitmaps we will have. We
    truncate and pre-allocate space everytime to make sure it's uptodate.

    This feature will be turned on as soon as you mount with -o space_cache, however
    it is safe to boot into old kernels, they will just generate the cache the old
    fashion way. When you boot back into a newer kernel we will notice that we
    modified and not the cache and automatically discard the cache.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

24 Jul, 2009

1 commit

  • Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
    tracking free space. As free space gets fragmented, we end up with thousands of
    entries on an rb-tree per block group, which usually spans 1 gig of area. Since
    we currently don't ever flush free space cache back to disk this gets to be a
    bit unweildly on large fs's with lots of fragmentation.

    This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
    space cache. Initially we calculate a threshold of extent entries we can
    handle, which is however many extent entries we can cram into 16k of ram. The
    maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
    will be 32k of RAM, which scales much better than we did before.

    Once we pass the extent threshold, we start adding bitmaps and using those
    instead for tracking the free space. This patch also makes it so that any free
    space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
    nice since we try and allocate out of the front of a block group, so if the
    front of a block group is heavily fragmented and then has a huge chunk of free
    space at the end, we go ahead and add the fragmented areas to bitmaps and use a
    normal extent entry to track the big chunk at the back of the block group.

    I've also taken the opportunity to revamp how we search for free space.
    Previously we indexed free space via an offset indexed rb tree and a bytes
    indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
    indexed rb tree. This cuts the number of tree operations we were doing
    previously down by half, and gives us a little bit of a better allocation
    pattern since we will always start from a specific offset and search forward
    from there, instead of searching for the size we need and try and get it as
    close as possible to the offset we want.

    I've given this a healthy amount of testing pre-new format stuff, as well as
    post-new format stuff. I've booted up my fedora box which is installed on btrfs
    with this patch and ran with it for a few days without issues. I've not seen
    any performance regressions in any of my tests.

    Since the last patch Yan Zheng fixed a problem where we could have overlapping
    entries, so updating their offset inline would cause problems. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

10 Jun, 2009

1 commit

  • Some SSDs perform best when reusing block numbers often, while
    others perform much better when clustering strictly allocates
    big chunks of unused space.

    The default mount -o ssd will find rough groupings of blocks
    where there are a bunch of free blocks that might have some
    allocated blocks mixed in.

    mount -o ssd_spread will make sure there are no allocated blocks
    mixed in. It should perform better on lower end SSDs.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

1 commit

  • Because btrfs is copy-on-write, we end up picking new locations for
    blocks very often. This makes it fairly difficult to maintain perfect
    read patterns over time, but we can at least do some optimizations
    for writes.

    This is done today by remembering the last place we allocated and
    trying to find a free space hole big enough to hold more than just one
    allocation. The end result is that we tend to write sequentially to
    the drive.

    This happens all the time for metadata and it happens for data
    when mounted -o ssd. But, the way we record it is fairly racey
    and it tends to fragment the free space over time because we are trying
    to allocate fairly large areas at once.

    This commit gets rid of the races by adding a free space cluster object
    with dedicated locking to make sure that only one process at a time
    is out replacing the cluster.

    The free space fragmentation is somewhat solved by allowing a cluster
    to be comprised of smaller free space extents. This part definitely
    adds some CPU time to the cluster allocations, but it allows the allocator
    to consume the small holes left behind by cow.

    Signed-off-by: Chris Mason

    Chris Mason