19 Apr, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
    Btrfs: fix free space cache leak
    Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
    Btrfs end_bio_extent_readpage should look for locked bits
    Btrfs: don't force chunk allocation in find_free_extent
    Btrfs: Check validity before setting an acl
    Btrfs: Fix incorrect inode nlink in btrfs_link()
    Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
    Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
    Btrfs: make uncache_state unconditional
    btrfs: using cached extent_state in set/unlock combinations
    Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
    Btrfs: fix subvolume mount by name problem when default mount subvolume is set
    fix user annotation in ioctl.c
    Btrfs: check for duplicate iov_base's when doing dio reads
    btrfs: properly handle overlapping areas in memmove_extent_buffer
    Btrfs: fix memory leaks in btrfs_new_inode()
    Btrfs: check for duplicate iov_base's when doing dio reads
    Btrfs: reuse the extent_map we found when calling btrfs_get_extent
    Btrfs: do not use async submit for small DIO io's
    Btrfs: don't split dio bios if we don't have to
    ...

    Linus Torvalds
     

13 Apr, 2011

2 commits


12 Apr, 2011

4 commits


09 Apr, 2011

6 commits

  • Apparently it is ok to submit a read to an IDE device with the same target page
    for different offsets. This is what Windows does under qemu. The problem is
    under DIO we expect them to be different buffers for checksumming reasons, and
    so this sort of thing will result in checksum errors, when in reality the file
    is fine. So when reading, check to make sure that all iov bases are different,
    and if they aren't fall back to buffered mode, since that will work out right.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
    range that we are looking for. If we don't find an extent, btrfs_get_extent
    will insert a extent_map for that area and mark it as a hole. So it does the
    job of allocating a new extent map and inserting it into the io tree. But if
    we're creating a new extent we free it up and redo all of that work. So instead
    pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
    disk space and set it up properly and bypass the freeing/allocating of a new
    extent map and the expensive operation of inserting the thing into the io_tree.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When looking at our DIO performance Chris said that for small IO's doing the
    async submit stuff tends to be more overhead than it's worth. With this on top
    of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
    IO's. Basically if we don't have to split the bio for the map length it's small
    enough to be directly submitted, otherwise go back to the async submit. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We have been unconditionally allocating a new bio and re-adding all pages from
    our original bio to the new bio. This is needed if our original bio is larger
    than our stripe size, but if it is smaller than the stripe size then there is no
    need to do this. So check the map length and if we are under that then go ahead
    and submit the original bio. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • In the DIO code we often don't update the i_disk_size because the i_size isn't
    updated until after the DIO is completed, so basically we are allocating a path,
    doing a search, and updating the inode item for no reason since nothing changed.
    btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
    only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Instead of calling kmap_atomic for every thing we set in the inode item, map the
    entire inode item at the start and unmap it at the end. This makes a sequential
    dd of 400mb O_DIRECT something like 1% faster. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

08 Apr, 2011

1 commit


06 Apr, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: don't warn in btrfs_add_orphan
    Btrfs: fix free space cache when there are pinned extents and clusters V2
    Btrfs: Fix uninitialized root flags for subvolumes
    btrfs: clear __GFP_FS flag in the space cache inode
    Btrfs: fix memory leak in start_transaction()
    Btrfs: fix memory leak in btrfs_ioctl_start_sync()
    Btrfs: fix subvol_sem leak in btrfs_rename()
    Btrfs: Fix oops for defrag with compression turned on
    Btrfs: fix /proc/mounts info.
    Btrfs: fix compiler warning in file.c

    Linus Torvalds
     

05 Apr, 2011

4 commits

  • When I moved the orphan adding to btrfs_truncate I missed the fact that during
    orphan cleanup we just add the orphan items to the orphan list without going
    through btrfs_orphan_add, which results in lots of warnings on mount if you have
    any orphan items that need to be truncated. Just remove this warning since it's
    ok, this will allow all of the normal space accounting take place. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • the object id of the space cache inode's key is allocated from the relative
    root, just like the regular file. So we can't identify space cache inode by
    checking the object id of the inode's key, and we have to clear __GFP_FS flag
    at the time we look up the space cache inode.

    Signed-off-by: Miao Xie
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Miao Xie
     
  • btrfs_rename() does not release the subvol_sem if the transaction failed to start.

    Signed-off-by: Johann Lombardi
    Signed-off-by: Chris Mason

    Johann Lombardi
     
  • When we defrag a file, whose size can be fit into an inline extent,
    with compression enabled, the compress type is set to be
    fs_info->compress_type, which is 0 if the btrfs filesystem is mounted
    without compress option. This leads to oops.

    Reported-by: Daniel Blueman
    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

31 Mar, 2011

1 commit


29 Mar, 2011

1 commit

  • …it/mason/btrfs-unstable

    * 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
    Btrfs: fix __btrfs_map_block on 32 bit machines
    btrfs: fix possible deadlock by clearing __GFP_FS flag
    btrfs: check link counter overflow in link(2)
    btrfs: don't mess with i_nlink of unlocked inode in rename()
    Btrfs: check return value of btrfs_alloc_path()
    Btrfs: fix OOPS of empty filesystem after balance
    Btrfs: fix memory leak of empty filesystem after balance
    Btrfs: fix return value of setflags ioctl
    Btrfs: fix uncheck memory allocations
    btrfs: make inode ref log recovery faster
    Btrfs: add btrfs_trim_fs() to handle FITRIM
    Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
    Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
    Btrfs: make update_reserved_bytes() public
    btrfs: return EXDEV when linking from different subvolumes
    Btrfs: Per file/directory controls for COW and compression
    Btrfs: add datacow flag in inode flag
    btrfs: use GFP_NOFS instead of GFP_KERNEL
    Btrfs: check return value of read_tree_block()
    btrfs: properly access unaligned checksum buffer
    ...

    Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
    the block layer.

    Linus Torvalds
     

28 Mar, 2011

8 commits

  • Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause
    deadlock.
    Task1
    open()
    ...
    btrfs_search_slot()
    ...
    btrfs_cow_block()
    ...
    alloc_page()
    wait for reclaiming
    shrink_slab()
    ...
    shrink_icache_memory()
    ...
    btrfs_evict_inode()
    ...
    btrfs_search_slot()

    If the path is locked by task1, the deadlock happens.

    So the btree's page cache is different with the file's page cache, it can not
    allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in
    GFP_HIGHUSER_MOVABLE flag.

    Reported-by: Itaru Kitayama
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Signed-off-by: Al Viro
    Signed-off-by: Chris Mason

    Al Viro
     
  • old_inode is not locked; it's not safe to play with its link
    count. Instead of bumping it and calling btrfs_unlink_inode(),
    add a variant of the latter that does not do btrfs_drop_nlink()/
    btrfs_update_inode(), call it instead of btrfs_inc_nlink()/
    btrfs_unlink_inode() and do btrfs_update_inode() ourselves.

    Signed-off-by: Al Viro
    Signed-off-by: Chris Mason

    Al Viro
     
  • Adding the check on the return value of btrfs_alloc_path() to several places.
    And, some of callers are modified by this change.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • To make Btrfs code more robust, several return value checks where memory
    allocation can fail are introduced. I use BUG_ON where I don't know how
    to handle the error properly, which increases the number of using the
    notorious BUG_ON, though.

    Signed-off-by: Yoshinori Sano
    Signed-off-by: Chris Mason

    Yoshinori Sano
     
  • btrfs_link returns EPERM if a cross-subvolume link is attempted.

    However, in this case I believe EXDEV to be the more appropriate value.
    >From the link(2) man page:

    EXDEV oldpath and newpath are not on the same mounted file system. (Linux
    permits a file system to be mounted at multiple points, but link()
    does not work across different mount points, even if the same file
    system is mounted on both.)

    This matters because an application may have different behaviors based on
    return codes.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • Data compression and data cow are controlled across the entire FS by mount
    options right now. ioctls are needed to set this on a per file or per
    directory basis. This has been proposed previously, but VFS developers
    wanted us to use generic ioctls rather than btrfs-specific ones.

    According to Chris's comment, there should be just one true compression
    method(probably LZO) stored in the super. However, before this, we would
    wait for that one method is stable enough to be adopted into the super.
    So I list it as a long term goal, and just store it in ram today.

    After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
    control file and directory's datacow and compression attribute.

    NOTE:
    - The compression type is selected by such rules:
    If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
    Otherwise, we'll use the default compress type (zlib today).

    v1->v2:
    - rebase to the latest btrfs.
    v2->v3:
    - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
    will be screwed by inheritance from parent directory.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Tracepoints can provide insight into why btrfs hits bugs and be greatly
    helpful for debugging, e.g
    dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
    dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
    btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
    flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
    flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
    flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
    flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

    Here is what I have added:

    1) ordere_extent:
    btrfs_ordered_extent_add
    btrfs_ordered_extent_remove
    btrfs_ordered_extent_start
    btrfs_ordered_extent_put

    These provide critical information to understand how ordered_extents are
    updated.

    2) extent_map:
    btrfs_get_extent

    extent_map is used in both read and write cases, and it is useful for tracking
    how btrfs specific IO is running.

    3) writepage:
    __extent_writepage
    btrfs_writepage_end_io_hook

    Pages are cirtical resourses and produce a lot of corner cases during writeback,
    so it is valuable to know how page is written to disk.

    4) inode:
    btrfs_inode_new
    btrfs_inode_request
    btrfs_inode_evict

    These can show where and when a inode is created, when a inode is evicted.

    5) sync:
    btrfs_sync_file
    btrfs_sync_fs

    These show sync arguments.

    6) transaction:
    btrfs_transaction_commit

    In transaction based filesystem, it will be useful to know the generation and
    who does commit.

    7) back reference and cow:
    btrfs_delayed_tree_ref
    btrfs_delayed_data_ref
    btrfs_delayed_ref_head
    btrfs_cow_block

    Btrfs natively supports back references, these tracepoints are helpful on
    understanding btrfs's COW mechanism.

    8) chunk:
    btrfs_chunk_alloc
    btrfs_chunk_free

    Chunk is a link between physical offset and logical offset, and stands for space
    infomation in btrfs, and these are helpful on tracing space things.

    9) reserved_extent:
    btrfs_reserved_extent_alloc
    btrfs_reserved_extent_free

    These can show how btrfs uses its space.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

26 Mar, 2011

2 commits

  • I noticed that dio_end_io calls the appropriate endio function with an error,
    but the endio functions don't actually do anything with that error, they assume
    that if there was an error then the bio will not be uptodate. So if we had
    checksum failures we would never pass back EIO. So if there is an error in our
    endio functions make sure to clear the uptodate flag on the bio. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When doing direct writes we store the checksums in the ordered sum stuff in the
    ordered extent for writing them when the write completes, so we don't even use
    the dip->csums array. So if we're writing, don't bother allocating dip->csums
    since we won't use it anyway. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

18 Mar, 2011

8 commits