16 Dec, 2011

1 commit

  • Running xfstests 269 with some tracing my scripts kept spitting out errors about
    releasing bytes that we didn't actually have reserved. This took me down a huge
    rabbit hole and it turns out the way we deal with reserved_extents is wrong,
    we need to only be setting it if the reservation succeeds, otherwise the free()
    method will come in and unreserve space that isn't actually reserved yet, which
    can lead to other warnings and such. The math was all working out right in the
    end, but it caused all sorts of other issues in addition to making my scripts
    yell and scream and generally make it impossible for me to track down the
    original issue I was looking for. The other problem is with our error handling
    in the reservation code. There are two cases that we need to deal with

    1) We raced with free. In this case free won't free anything because csum_bytes
    is modified before we dro the lock in our reservation path, so free rightly
    doesn't release any space because the reservation code may be depending on that
    reservation. However if we fail, we need the reservation side to do the free at
    that point since that space is no longer in use. So as it stands the code was
    doing this fine and it worked out, except in case #2

    2) We don't race with free. Nobody comes in and changes anything, and our
    reservation fails. In this case we didn't reserve anything anyway and we just
    need to clean up csum_bytes but not free anything. So we keep track of
    csum_bytes before we drop the lock and if it hasn't changed we know we can just
    decrement csum_bytes and carry on.

    Because of the case where we can race with free()'s since we have to drop our
    spin_lock to do the reservation, I'm going to serialize all reservations with
    the i_mutex. We already get this for free in the heavy use paths, truncate and
    file write all hold the i_mutex, just needed to add it to page_mkwrite and
    various ioctl/balance things. With this patch my space leak scripts no longer
    scream bloody murder. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

11 Nov, 2011

1 commit

  • If the root node of a fs/file tree is in the block group that is
    being relocated, but the others are not in the other block groups.
    when we create a snapshot for this tree between the relocation tree
    creation ends and ->create_reloc_tree is set to 0, Btrfs will create
    some backref nodes that are the lowest nodes of the backrefs cache.
    But we forget to add them into ->leaves list of the backref cache
    and deal with them, and at last, they will triggered BUG_ON().

    kernel BUG at fs/btrfs/relocation.c:239!

    This patch fixes it by adding them into ->leaves list of backref cache.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

21 Oct, 2011

1 commit


20 Oct, 2011

6 commits

  • Currently btrfs_block_rsv_check does 2 things, it will either refill a block
    reserve like in the truncate or refill case, or it will check to see if there is
    enough space in the global reserve and possibly refill it. However because of
    overcommit we could be well overcommitting ourselves just to try and refill the
    global reserve, when really we should just be committing the transaction. So
    breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill
    will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
    it will only tell you if the factor of the total space is still reserved.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Johannes pointed out we were allocating only kernel pages for doing writes,
    which is kind of a big deal if you are on 32bit and have more than a gig of ram.
    So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
    don't re-enter. Thanks,

    Reported-by: Johannes Weiner
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • The only thing that we need to have a trans handle for is in
    reserve_metadata_bytes and thats to know how much flushing we can do. So
    instead of passing it around, just check current->journal_info for a
    trans_handle so we know if we can commit a transaction to try and free up space
    or not. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • If you run xfstest 224 it you will get lots of messages about not being able to
    delete inodes and that they will be cleaned up next mount. This is because
    btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
    flush, so if there was not enough space, it simply failed. But in truncate and
    evict case we could easily flush space to try and get enough space to do our
    work, so make btrfs_block_rsv_check take a flush argument to pass down to
    reserve_metadata_bytes. Now xfstests 224 runs fine without all those
    complaints. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • The priority and refill_used flags are not used anymore, and neither is the
    usage counter, so just remove them from btrfs_block_rsv.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This is confusing code and isn't used by anything anymore, so delete it.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

28 Jul, 2011

2 commits


20 Jun, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: avoid delayed metadata items during commits
    btrfs: fix uninitialized return value
    btrfs: fix wrong reservation when doing delayed inode operations
    btrfs: Remove unused sysfs code
    btrfs: fix dereference of ERR_PTR value
    Btrfs: fix relocation races
    Btrfs: set no_trans_join after trying to expand the transaction
    Btrfs: protect the pending_snapshots list with trans_lock
    Btrfs: fix path leakage on subvol deletion
    Btrfs: drop the delalloc_bytes check in shrink_delalloc
    Btrfs: check the return value from set_anon_super

    Linus Torvalds
     

18 Jun, 2011

1 commit

  • The recent commit to get rid of our trans_mutex introduced
    some races with block group relocation. The problem is that relocation
    needs to do some record keeping about each root, and it was relying
    on the transaction mutex to coordinate things in subtle ways.

    This fix adds a mutex just for the relocation code and makes sure
    it doesn't have a big impact on normal operations. The race is
    really fixed in btrfs_record_root_in_trans, which is where we
    step back and wait for the relocation code to finish accounting
    setup.

    Signed-off-by: Chris Mason

    Chris Mason
     

05 Jun, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
    btrfs: fix uninitialized variable warning
    btrfs: add helper for fs_info->closing
    Btrfs: add mount -o inode_cache
    btrfs: scrub: add explicit plugging
    btrfs: use btrfs_ino to access inode number
    Btrfs: don't save the inode cache if we are deleting this root
    btrfs: false BUG_ON when degraded
    Btrfs: don't save the inode cache in non-FS roots
    Btrfs: make sure we don't overflow the free space cache crc page
    Btrfs: fix uninit variable in the delayed inode code
    btrfs: scrub: don't reuse bios and pages
    Btrfs: leave spinning on lookup and map the leaf
    Btrfs: check for duplicate entries in the free space cache
    Btrfs: don't try to allocate from a block group that doesn't have enough space
    Btrfs: don't always do readahead
    Btrfs: try not to sleep as much when doing slow caching
    Btrfs: kill BTRFS_I(inode)->block_group
    Btrfs: don't look at the extent buffer level 3 times in a row
    Btrfs: map the node block when looking for readahead targets
    Btrfs: set range_start to the right start in count_range_bits
    ...

    Linus Torvalds
     

28 May, 2011

2 commits

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
    Btrfs: use the device_list_mutex during write_dev_supers
    Btrfs: setup free ino caching in a more asynchronous way
    btrfs scrub: don't coalesce pages that are logically discontiguous
    Btrfs: return -ENOMEM in clear_extent_bit
    Btrfs: add mount -o auto_defrag
    Btrfs: using rcu lock in the reader side of devices list
    Btrfs: drop unnecessary device lock
    Btrfs: fix the race between remove dev and alloc chunk
    Btrfs: fix the race between reading and updating devices
    Btrfs: fix bh leak on __btrfs_open_devices path
    Btrfs: fix unsafe usage of merge_state
    Btrfs: allocate extent state and check the result properly
    fs/btrfs: Add missing btrfs_free_path
    Btrfs: check return value of btrfs_inc_extent_ref()
    Btrfs: return error to caller if read_one_inode() fails
    Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
    Btrfs: return error code to caller when btrfs_del_item fails
    Btrfs: return error code to caller when btrfs_previous_item fails
    btrfs: fix typo 'testeing' -> 'testing'
    btrfs: typo: 'btrfS' -> 'btrfs'
    ...

    Linus Torvalds
     

24 May, 2011

3 commits

  • Our readahead is sort of sloppy, and really isn't always needed. For example if
    ls is doing a stating ls (which is the default) it's going to stat in non-disk
    order, so if say you have a directory with a stupid amount of files, readahead
    is going to do nothing but waste time in the case of doing the stat. Taking the
    unconditional readahead out made my test go from 57 minutes to 36 minutes. This
    means that everywhere we do loop through the tree we want to make sure we do set
    path->reada properly, so I went through and found all of the places where we
    loop through the path and set reada to 1. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We use trans_mutex for lots of things, here's a basic list

    1) To serialize trans_handles joining the currently running transaction
    2) To make sure that no new trans handles are started while we are committing
    3) To protect the dead_roots list and the transaction lists

    Really the serializing trans_handles joining is not too hard, and can really get
    bogged down in acquiring a reference to the transaction. So replace the
    trans_mutex with a trans_lock spinlock and use it to do the following

    1) Protect fs_info->running_transaction. All trans handles have to do is check
    this, and then take a reference of the transaction and keep on going.
    2) Protect the fs_info->trans_list. This doesn't get used too much, basically
    it just holds the current transactions, which will usually just be the currently
    committing transaction and the currently running transaction at most.
    3) Protect the dead roots list. This is only ever processed by splicing the
    list so this is relatively simple.
    4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
    the trans_mutex before, so this is a pretty straightforward change.
    5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
    the entirety of the commit we need to have a way to block new people from
    creating a new transaction while we're doing our work. So we set no_trans_join
    and in join_transaction we test to see if that is set, and if it is we do a
    wait_on_commit.
    6) Make the transaction use count atomic so we don't need to take locks to
    modify it when we're dropping references.
    7) Add a commit_lock to the transaction to make sure multiple people trying to
    commit the same transaction don't race and commit at the same time.
    8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
    trans.

    I have tested this with xfstests, but obviously it is a pretty hairy change so
    lots of testing is greatly appreciated. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I keep forgetting that btrfs_join_transaction() just ignores the num_items
    argument, which leads me to sending pointless patches and looking stupid :). So
    just kill the num_items argument from btrfs_join_transaction and
    btrfs_start_ioctl_transaction, since neither of them use it. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 May, 2011

2 commits


21 May, 2011

1 commit


12 May, 2011

1 commit

  • This adds an initial implementation for scrub. It works quite
    straightforward. The usermode issues an ioctl for each device in the
    fs. For each device, it enumerates the allocated device chunks. For
    each chunk, the contained extents are enumerated and the data checksums
    fetched. The extents are read sequentially and the checksums verified.
    If an error occurs (checksum or EIO), a good copy is searched for. If
    one is found, the bad copy will be rewritten.
    All enumerations happen from the commit roots. During a transaction
    commit, the scrubs get paused and afterwards continue from the new
    roots.

    This commit is based on the series originally posted to linux-btrfs
    with some improvements that resulted from comments from David Sterba,
    Ilya Dryomov and Jan Schmidt.

    Signed-off-by: Arne Jansen

    Arne Jansen
     

10 May, 2011

1 commit


06 May, 2011

1 commit

  • Remove static and global declarations and/or definitions. Reduces size
    of btrfs.ko by ~3.4kB.

    text data bss dec hex filename
    402081 7464 200 409745 64091 btrfs.ko.base
    398620 7144 200 405964 631cc btrfs.ko.remove-all

    Signed-off-by: David Sterba

    David Sterba
     

02 May, 2011

4 commits


25 Apr, 2011

2 commits

  • There's a potential problem in 32bit system when we exhaust 32bit inode
    numbers and start to allocate big inode numbers, because btrfs uses
    inode->i_ino in many places.

    So here we always use BTRFS_I(inode)->location.objectid, which is an
    u64 variable.

    There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
    inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
    and inode->i_ino will be used in those cases.

    Another reason to make this change is I'm going to use a special inode
    to save free ino cache, and the inode number must be > (u64)-256.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Currently btrfs stores the highest objectid of the fs tree, and it always
    returns (highest+1) inode number when we create a file, so inode numbers
    won't be reclaimed when we delete files, so we'll run out of inode numbers
    as we keep create/delete files in 32bits machines.

    This fixes it, and it works similarly to how we cache free space in block
    cgroups.

    We start a kernel thread to read the file tree. By scanning inode items,
    we know which chunks of inode numbers are free, and we cache them in
    an rb-tree.

    Because we are searching the commit root, we have to carefully handle the
    cross-transaction case.

    The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
    chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
    of extents, and a bitmap will be used if we exceed this threshold. The
    extents threshold is adjusted in runtime.

    Signed-off-by: Li Zefan

    Li Zefan
     

31 Mar, 2011

1 commit


28 Mar, 2011

1 commit


18 Mar, 2011

1 commit

  • If we cannot truncate an inode for some reason we will never delete the orphan
    item associated with that inode, which means that we will loop forever in
    btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
    mount. It sucks, but hey it's better than hanging. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

17 Feb, 2011

1 commit

  • Btrfs device shrinking and balancing ends up reallocating all the blocks
    in order to allow COW to move them to new destinations. It is somewhat
    awkward in terms of ENOSPC because most of the enospc code is built
    around the idea that some operation on a reference counted tree triggers
    allocations in the non-reference counted trees.

    This commit changes the balancing code to deal with enospc by trying to
    allocate a new chunk. If that allocation succeeds, we go ahead and
    retry whatever failed due to enospc.

    Signed-off-by: Chris Mason

    Chris Mason
     

15 Feb, 2011

1 commit


01 Feb, 2011

1 commit


29 Jan, 2011

1 commit

  • The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
    is added, and the mistake of the error check in several places is
    corrected.

    For more stable Btrfs, I think that we should reduce BUG_ON().
    But, I think that long time is necessary for this.
    So, I propose this patch as a short-term solution.

    With this patch:
    - To more stable Btrfs, the part that should be corrected is clarified.
    - The panic isn't done by the NULL pointer reference etc. (even if
    BUG_ON() is increased temporarily)
    - The error code is returned in the place where the error can be easily
    returned.

    As a long-term plan:
    - BUG_ON() is reduced by using the forced-readonly framework, etc.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     

30 Oct, 2010

1 commit

  • These are all the cases where a variable is set, but not
    read which are really bugs.

    - Couple of incorrect error handling fixed.
    - One incorrect use of a allocation policy
    - Some other things

    Still needs more review.

    Found by gcc 4.6's new warnings.

    [akpm@linux-foundation.org: fix build. Might have been bitrot]
    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     

29 Oct, 2010

1 commit