04 Jan, 2012

2 commits


16 Dec, 2011

2 commits

  • …/btrfs-work into integration

    Conflicts:
    fs/btrfs/inode.c

    Signed-off-by: Chris Mason <chris.mason@oracle.com>

    Chris Mason
     
  • Running xfstests 269 with some tracing my scripts kept spitting out errors about
    releasing bytes that we didn't actually have reserved. This took me down a huge
    rabbit hole and it turns out the way we deal with reserved_extents is wrong,
    we need to only be setting it if the reservation succeeds, otherwise the free()
    method will come in and unreserve space that isn't actually reserved yet, which
    can lead to other warnings and such. The math was all working out right in the
    end, but it caused all sorts of other issues in addition to making my scripts
    yell and scream and generally make it impossible for me to track down the
    original issue I was looking for. The other problem is with our error handling
    in the reservation code. There are two cases that we need to deal with

    1) We raced with free. In this case free won't free anything because csum_bytes
    is modified before we dro the lock in our reservation path, so free rightly
    doesn't release any space because the reservation code may be depending on that
    reservation. However if we fail, we need the reservation side to do the free at
    that point since that space is no longer in use. So as it stands the code was
    doing this fine and it worked out, except in case #2

    2) We don't race with free. Nobody comes in and changes anything, and our
    reservation fails. In this case we didn't reserve anything anyway and we just
    need to clean up csum_bytes but not free anything. So we keep track of
    csum_bytes before we drop the lock and if it hasn't changed we know we can just
    decrement csum_bytes and carry on.

    Because of the case where we can race with free()'s since we have to drop our
    spin_lock to do the reservation, I'm going to serialize all reservations with
    the i_mutex. We already get this for free in the heavy use paths, truncate and
    file write all hold the i_mutex, just needed to add it to page_mkwrite and
    various ioctl/balance things. With this patch my space leak scripts no longer
    scream bloody murder. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

15 Dec, 2011

1 commit

  • To reproduce the bug:

    # touch /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800
    # chattr +i /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:43.198105295 +0800
    # umount /mnt
    # mount /dev/loop1 /mnt
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800

    We should update ctime of in-memory inode before calling
    btrfs_update_inode().

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

01 Dec, 2011

1 commit


20 Nov, 2011

2 commits

  • For the user it is confusing to find something like:
    [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
    in kernel log, because it doesn't point directly to btrfs.

    This patch prefixes those messages with "btrfs:" like other btrfs
    related printks.

    Signed-off-by: Arnd Hannemann
    Signed-off-by: Chris Mason

    Arnd Hannemann
     
  • This patch casts to unsigned long before casting to a pointer and fixes
    the following warnings:
    fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Jeff Mahoney
     

06 Nov, 2011

3 commits


24 Oct, 2011

2 commits


21 Oct, 2011

6 commits

  • We should retirn EINVAL if the start is beyond the end of the file
    system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
    check for it.

    Also in the btrfs_trim_fs() it is possible that len+start might overflow
    if big values are passed. Fix it by decrementing the len so that start+len
    is equal to the file system size in the worst case.

    Signed-off-by: Lukas Czerner

    Lukas Czerner
     
  • We won't defrag an extent, if it's bigger than the threshold we
    specified and there's no small extent before it, but actually
    the code doesn't work this way.

    There are three bugs:

    - When should_defrag_range() decides we should keep on defragmenting
    an extent, last_len is not incremented. (old bug)

    - The length that passes to should_defrag_range() is not the length
    we're going to defrag. (new bug)

    - We always defrag 256K bytes data, and a big extent can be part of
    this range. (new bug)

    For a file with 4 extents:

    | 4K | 4K | 256K | 256K |

    The result of defrag with (the default) 256K extent thresh should be:

    | 264K | 256K |

    but with those bugs, we'll get:

    | 520K |

    Signed-off-by: Li Zefan

    Li Zefan
     
  • It's off-by-one, and thus we may skip the last page while defragmenting.

    An example case:

    # create /mnt/file with 2 4K file extents
    # btrfs fi defrag /mnt/file
    # sync
    # filefrag /mnt/file
    /mnt/file: 2 extents found

    So it's not defragmented.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Don't use inode->i_size directly, since we're not holding i_mutex.

    This also fixes another bug, that i_size can change after it's checked
    against 0 and then (i_size - 1) can be negative.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • There's an off-by-one bug:

    # create a file with lots of 4K file extents
    # btrfs fi defrag /mnt/file
    # sync
    # filefrag -v /mnt/file
    Filesystem type is: 9123683e
    File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
    ext logical physical expected length flags
    0 0 3372 64
    1 64 3136 3435 1
    2 65 3436 3136 64
    3 129 3201 3499 1
    4 130 3500 3201 64
    5 194 3266 3563 1
    6 195 3564 3266 64
    7 259 3331 3627 1
    8 260 3628 3331 40 eof

    After this patch:

    ...
    # filefrag -v /mnt/file
    Filesystem type is: 9123683e
    File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
    ext logical physical expected length flags
    0 0 3372 300 eof
    /mnt/file: 1 extent found

    Signed-off-by: Li Zefan

    Li Zefan
     
  • kmemleak found this:
    unreferenced object 0xffff8801b64af968 (size 512):
    comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
    hex dump (first 32 bytes):
    00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff ................
    80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff ................
    backtrace:
    [] kmemleak_alloc+0x5c/0xc0
    [] kmem_cache_alloc_trace+0x163/0x240
    [] btrfs_defrag_file+0xf0/0xb20
    [] btrfs_run_defrag_inodes+0x165/0x210
    [] cleaner_kthread+0x177/0x190
    [] kthread+0x8d/0xa0
    [] kernel_thread_helper+0x4/0x10
    [] 0xffffffffffffffff

    "pages" is not always freed. Fix it removing the unnecesary additional return.

    Signed-off-by: Diego Calleja

    Diego Calleja
     

20 Oct, 2011

2 commits

  • Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
    weren't supposed to. There isn't any specific documentation on this so I'm
    taking the test as the standard of how things work, and having S_APPEND set on a
    directory doesn't mean that S_APPEND gets inherited by its children according to
    this test. So only inherit btrfs specific things. This will let us set
    compress/nocompress on specific directories and everything in the directories
    will inherit this flag, same with nodatacow. With this patch test 79 passes.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Johannes pointed out we were allocating only kernel pages for doing writes,
    which is kind of a big deal if you are on 32bit and have more than a gig of ram.
    So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
    don't re-enter. Thanks,

    Reported-by: Johannes Weiner
    Signed-off-by: Josef Bacik

    Josef Bacik
     

13 Oct, 2011

1 commit


11 Oct, 2011

2 commits

  • The btrfs file defrag code will loop through the extents and
    force COW on them. But there is a concurrent truncate in the middle of
    the defrag, it might end up defragging the same range over and over
    again.

    The problem is that writepage won't go through and do anything on pages
    past i_size, so the cow won't happen, so the file will appear to still
    be fragmented. defrag will end up hitting the same extents again and
    again.

    In the worst case, the truncate can actually live lock with the defrag
    because the defrag keeps creating new ordered extents which the truncate
    code keeps waiting on.

    The fix here is to make defrag check for i_size inside the main loop,
    instead of just once before the looping starts.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Follow those steps:

    # mount -o autodefrag /dev/sda7 /mnt
    # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
    # sync
    # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

    and then it'll go into a loop: writeback -> defrag -> writeback ...

    It's because writeback writes [8K, 200K] and then writes [0, 8K].

    I tried to make writeback know if the pages are dirtied by defrag,
    but the patch was a bit intrusive. Here I simply set writeback_index
    when we defrag a file.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

29 Sep, 2011

1 commit


21 Sep, 2011

2 commits


18 Sep, 2011

4 commits

  • Chris Mason
     
  • The dst file will have the same inode flags with dst file after
    file clone, and I think it's unexpected.

    For example, the dst file will suddenly become immutable after
    getting some share of data with src file, if the src is immutable.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • To reproduce the bug:

    # mount /dev/sda7 /mnt
    # dd if=/dev/zero of=/mnt/src bs=4K count=1
    # umount /mnt

    # mount -o nodatasum /dev/sda7 /mnt
    # dd if=/dev/zero of=/mnt/dst bs=4K count=1
    # clone_range -s 4K -l 4K /mnt/src /mnt/dst

    # echo 3 > /proc/sys/vm/drop_caches
    # cat /mnt/dst
    # dmesg
    ...
    btrfs no csum found for inode 258 start 0
    btrfs csum failed ino 258 off 0 csum 2566472073 private 0

    It's because part of the file is checksummed and the other part is not,
    and then btrfs will complain checksum is not found when we read the file.

    Disallow file clone if src and dst file have different checksum flag,
    so we ensure a file is completely checksummed or unchecksummed.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480
    (Btrfs: truncate pages from clone ioctl target range)

    We should pass the dest range to the truncate function, but not the
    src range.

    Also move the function before locking extent state.

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     

13 Sep, 2011

1 commit

  • * 'for-linus' of git://github.com/chrismason/linux:
    Btrfs: add dummy extent if dst offset excceeds file end in
    Btrfs: calc file extent num_bytes correctly in file clone
    btrfs: xattr: fix attribute removal
    Btrfs: fix wrong nbytes information of the inode
    Btrfs: fix the file extent gap when doing direct IO
    Btrfs: fix unclosed transaction handle in btrfs_cont_expand
    Btrfs: fix misuse of trans block rsv
    Btrfs: reset to appropriate block rsv after orphan operations
    Btrfs: skip locking if searching the commit root in csum lookup
    btrfs: fix warning in iput for bad-inode
    Btrfs: fix an oops when deleting snapshots

    Linus Torvalds
     

11 Sep, 2011

2 commits


18 Aug, 2011

1 commit


17 Aug, 2011

1 commit

  • We need to truncate page cache pages for the clone ioctl target range or
    else we'll confuse ourselves to no end. If the old data was cached, we
    used to still see it (until remount). If the page was partially updated
    we used to get a mix of old and new data.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

03 Aug, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
    Btrfs: don't call writepages from within write_full_page
    Btrfs: Remove unused variable 'last_index' in file.c
    Btrfs: clean up for find_first_extent_bit()
    Btrfs: clean up for wait_extent_bit()
    Btrfs: clean up for insert_state()
    Btrfs: remove unused members from struct extent_state
    Btrfs: clean up code for merging extent maps
    Btrfs: clean up code for extent_map lookup
    Btrfs: clean up search_extent_mapping()
    Btrfs: remove redundant code for dir item lookup
    Btrfs: make acl functions really no-op if acl is not enabled
    Btrfs: remove remaining ref-cache code
    Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
    Btrfs: use wait_event()
    Btrfs: check the nodatasum flag when writing compressed files
    Btrfs: copy string correctly in INO_LOOKUP ioctl
    Btrfs: don't print the leaf if we had an error
    btrfs: make btrfs_set_root_node void
    Btrfs: fix oops while writing data to SSD partitions
    Btrfs: Protect the readonly flag of block group
    ...

    Fix up trivial conflicts (due to acl and writeback cleanups) in
    - fs/btrfs/acl.c
    - fs/btrfs/ctree.h
    - fs/btrfs/extent_io.c

    Linus Torvalds
     

02 Aug, 2011

1 commit


28 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
    Btrfs: use the commit_root for reading free_space_inode crcs
    Btrfs: reduce extent_state lock contention for metadata
    Btrfs: remove lockdep magic from btrfs_next_leaf
    Btrfs: make a lockdep class for each root
    Btrfs: switch the btrfs tree locks to reader/writer
    Btrfs: fix deadlock when throttling transactions
    Btrfs: stop using highmem for extent_buffers
    Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
    Btrfs: tag pages for writeback in sync
    Btrfs: fix enospc problems with delalloc
    Btrfs: don't flush delalloc arbitrarily
    Btrfs: use find_or_create_page instead of grab_cache_page
    Btrfs: use a worker thread to do caching
    Btrfs: fix how we merge extent states and deal with cached states
    Btrfs: use the normal checksumming infrastructure for free space cache
    Btrfs: serialize flushers in reserve_metadata_bytes
    Btrfs: do transaction space reservation before joining the transaction
    Btrfs: try to only do one btrfs_search_slot in do_setxattr

    Linus Torvalds
     
  • So I had this brilliant idea to use atomic counters for outstanding and reserved
    extents, but this turned out to be a bad idea. Consider this where we have 1
    outstanding extent and 1 reserved extent

    Reserver Releaser
    atomic_dec(outstanding) now 0
    atomic_read(outstanding)+1 get 1
    atomic_read(reserved) get 1
    don't actually reserve anything because
    they are the same
    atomic_cmpxchg(reserved, 1, 0)
    atomic_inc(outstanding)
    atomic_add(0, reserved)
    free reserved space for 1 extent

    Then the reserver now has no actual space reserved for it, and when it goes to
    finish the ordered IO it won't have enough space to do it's allocation and you
    get those lovely warnings.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik