13 Aug, 2012

1 commit


09 Aug, 2012

1 commit

  • We got a recursive lock in mksubvol because the caller already held
    a lock. I think we got into this due to a merge error. Commit a874a63
    removed the mnt_want_write call from btrfs_mksubvol and added a
    replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid.
    Commit e7848683 however tried to move all calls to mnt_want_write above
    i_mutex. So somewhere while merging this, it got mixed up. The
    solution is to remove the mnt_want_write call completely from
    mksubvol.

    Reported-by: David Sterba
    Signed-off-by: Alexander Block
    Signed-off-by: Chris Mason

    Alexander Block
     

04 Aug, 2012

2 commits


02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

3 commits

  • We convert btrfs_file_aio_write() to use new freeze check. We also add proper
    freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
    the transaction mechanism to avoid starting transactions on frozen filesystem.
    At minimum this is necessary to stop iput() of unlinked file to change frozen
    filesystem during truncation.

    Checks in cleaner_kthread() and transaction_kthread() can be safely removed
    since btrfs_freeze() will lock the mutexes and thus block the threads (and they
    shouldn't have anything to do anyway).

    CC: linux-btrfs@vger.kernel.org
    CC: Chris Mason
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Use the generic printk_get_level() to search a message for a kern_level.

    Add __printf to verify format and arguments. Fix a few messages that
    had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks
    to shrink the object size a bit when not using printk.

    [akpm@linux-foundation.org: whitespace tweak]
    Signed-off-by: Joe Perches
    Cc: Kay Sievers
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When mnt_want_write() starts to handle freezing it will get a full lock
    semantics requiring proper lock ordering. So push mnt_want_write() call
    consistently outside of i_mutex.

    CC: Chris Mason
    CC: linux-btrfs@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

27 Jul, 2012

2 commits

  • On powerpc, we don't get the implicit vmalloc.h include, and as a result
    the build fails noisily:

    fs/btrfs/send.c: In function 'fs_path_free':
    fs/btrfs/send.c:185:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
    fs/btrfs/send.c: In function 'fs_path_ensure_buf':
    fs/btrfs/send.c:215:4: error: implicit declaration of function 'vmalloc' [-Werror=implicit-function-declaration]
    fs/btrfs/send.c:215:12: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:225:12: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:233:13: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c: In function 'iterate_dir_item':
    fs/btrfs/send.c:900:10: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:909:11: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c: In function 'btrfs_ioctl_send':
    fs/btrfs/send.c:4463:17: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:4469:17: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:4475:2: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
    fs/btrfs/send.c:4475:20: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/send.c:4483:21: warning: assignment makes pointer from integer without a cast [enabled by default]

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Pull large btrfs update from Chris Mason:
    "This pull request is very large, and the two main features in here
    have been under testing/devel for quite a while.

    We have subvolume quotas from the strato developers. This enables
    full tracking of how many blocks are allocated to each subvolume (and
    all snapshots) and you can set limits on a per-subvolume basis. You
    can also create quota groups and toss multiple subvolumes into a big
    group. It's everything you need to be a web hosting company and give
    each user their own subvolume.

    The userland side of the quotas is being refreshed, they'll send out
    details on where to grab it soon.

    Next is the kernel side of btrfs send/receive from Alexander Block.
    This leverages the same infrastructure as the quota code to figure out
    relationships between blocks and their owners. It can then compute
    the difference between two snapshots and sends the diffs in a neutral
    format into userland.

    The basic model:

    create a snapshot
    send that snapshot as the initial backup
    make changes
    create a second snapshot
    send the incremental as a backup
    delete the first snapshot
    (use the second snapshot for the next incremental)

    The receive portion is all in userland, and in the 'next' branch of my
    btrfs-progs repo.

    There's still some work to do in terms of optimizing the send side
    from kernel to userland. The really important part is figuring out
    how two snapshots are different, and this is where we are
    concentrating right now. The initial send of a dataset is a little
    slower than tar, but the incremental sends are dramatically faster
    than what rsync can do.

    On top of all of that, we have a nice queue of fixes, cleanups and
    optimizations."

    Fix up trivial modify/del conflict in fs/btrfs/ioctl.c

    Also fix up semantic conflict in fs/btrfs/send.c: the interface to
    dentry_open() changed in commit 765927b2d508 ("switch dentry_open() to
    struct path, make it grab references itself"), and since it now grabs
    whatever references it needs, we should no longer do the mntget() on the
    mnt (and we need to dput() the dentry reference we took).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
    Btrfs: uninit variable fixes in send/receive
    Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
    Btrfs: add btrfs_compare_trees function
    Btrfs: introduce subvol uuids and times
    Btrfs: make iref_to_path non static
    Btrfs: add a barrier before a waitqueue_active check
    Btrfs: call the ordered free operation without any locks held
    Btrfs: Check INCOMPAT flags on remount and add helper function
    Btrfs: add helper for tree enumeration
    btrfs: allow cross-subvolume file clone
    Btrfs: improve multi-thread buffer read
    Btrfs: make btrfs's allocation smoothly with preallocation
    Btrfs: lock the transition from dirty to writeback for an eb
    Btrfs: fix potential race in extent buffer freeing
    Btrfs: don't return true in releasepage unless we actually freed the eb
    Btrfs: suppress printk() if all device I/O stats are zero
    Btrfs: remove unwanted printk() for btrfs device I/O stats
    Btrfs: rewrite BTRFS_SETGET_FUNCS
    Btrfs: zero unused bytes in inode item
    Btrfs: kill free_space pointer from inode structure
    ...

    Conflicts:
    fs/btrfs/ioctl.c

    Linus Torvalds
     

26 Jul, 2012

10 commits


25 Jul, 2012

3 commits

  • Often no exact match is wanted but just the next lower or
    higher item. There's a lot of duplicated code throughout
    btrfs to deal with the corner cases. This patch adds a
    helper function that can facilitate searching.

    Signed-off-by: Arne Jansen

    Arne Jansen
     
  • Lift the EXDEV condition and allow different root trees for files being
    cloned, then pass source inode's root when searching for extents.
    Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from
    one filesystem are mounted separately.

    Signed-off-by: David Sterba

    David Sterba
     
  • Pull trivial tree from Jiri Kosina:
    "Trivial updates all over the place as usual."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
    Fix typo in include/linux/clk.h .
    pci: hotplug: Fix typo in pci
    iommu: Fix typo in iommu
    video: Fix typo in drivers/video
    Documentation: Add newline at end-of-file to files lacking one
    arm,unicore32: Remove obsolete "select MISC_DEVICES"
    module.c: spelling s/postition/position/g
    cpufreq: Fix typo in cpufreq driver
    trivial: typo in comment in mksysmap
    mach-omap2: Fix typo in debug message and comment
    scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
    Change email address for Steve Glendinning
    Btrfs: fix typo in convert_extent_bit
    via: Remove bogus if check
    netprio_cgroup.c: fix comment typo
    backlight: fix memory leak on obscure error path
    Documentation: asus-laptop.txt references an obsolete Kconfig item
    Documentation: ManagementStyle: fixed typo
    mm/vmscan: cleanup comment error in balance_pgdat
    mm: cleanup on the comments of zone_reclaim_stat
    ...

    Linus Torvalds
     

24 Jul, 2012

17 commits

  • While testing with my buffer read fio jobs[1], I find that btrfs does not
    perform well enough.

    Here is a scenario in fio jobs:

    We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
    and all of them will race on add_to_page_cache_lru(), and if one thread
    successfully puts its page into the page cache, it takes the responsibility
    to read the page's data.

    And what's more, reading a page needs a period of time to finish, in which
    other threads can slide in and process rest pages:

    t1 t2 t3 t4
    add Page1
    read Page1 add Page2
    | read Page2 add Page3
    | | read Page3 add Page4
    | | | read Page4
    -----|------------|-----------|-----------|--------
    v v v v
    bio bio bio bio

    Now we have four bios, each of which holds only one page since we need to
    maintain consecutive pages in bio. Thus, we can end up with far more bios
    than we need.

    Here we're going to
    a) delay the real read-page section and
    b) try to put more pages into page cache.

    With that said, we can make each bio hold more pages and reduce the number
    of bios we need.

    Here is some numbers taken from fio results:
    w/o patch w patch
    ------------- -------- ---------------
    READ: 745MB/s +25% 934MB/s

    [1]:
    [global]
    group_reporting
    thread
    numjobs=4
    bs=32k
    rw=read
    ioengine=sync
    directory=/mnt/btrfs/

    [READ]
    filename=foobar
    size=2000M
    invalidate=1

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • For backref walking, we've introduce delayed ref's sequence. However,
    it changes our preallocation behavior.

    The story is that when we preallocate an extent and then mark it written
    piece by piece, the ideal case should be that we don't need to COW the
    extent, which is why we use 'preallocate'.

    But we may not make use of preallocation, since when we check for cross refs on
    the extent, we may have two ref entries which have the same content except
    the sequence value, and we recognize them as cross refs and do COW to allocate
    another extent.

    So we end up with several pieces of space instead of an whole extent.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • There is a small window where an eb can have no IO bits set on it, which
    could potentially result in extent_buffer_under_io() returning false when we
    want it to return true, which could result in not fun things happening. So
    in order to protect this case we need to hold the refs_lock when we make
    this transition to make sure we get reliable results out of
    extent_buffer_udner_io(). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This sounds sort of impossible but it is the only thing I can think of and
    at the very least it is theoretically possible so here it goes.

    If we are in try_release_extent_buffer we will check that the ref count on
    the extent buffer is 1 and not under IO, and then go down and clear the tree
    ref. If between this check and clearing the tree ref somebody else comes in
    and grabs a ref on the eb and the marks it dirty before
    try_release_extent_buffer() does it's tree ref clear we can end up with a
    dirty eb that will be freed while it is still dirty which will result in a
    panic. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I noticed while looking at an extent_buffer race that we will
    unconditionally return 1 if we get down to release_extent_buffer after
    clearing the tree ref. However we can easily race in here and get a ref on
    the eb and not actually free the eb. So make release_extent_buffer return 1
    if it free'd the eb and 0 if not so we can be a little kinder to the vm.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Code is added to suppress the I/O stats printing at mount time if all
    statistic values are zero.

    Signed-off-by: Stefan Behrens

    Stefan Behrens
     
  • People complained about the annoying kernel log message
    "btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
    everytime a filesystem is mounted for the first time after running
    mkfs. Since the distribution of the btrfs-progs is not synchronized
    to the kernel version, mkfs like it is now will be used also in the
    future. Then this message is not useful to find errors, it is just
    annoying. This commit removes the printk().

    Signed-off-by: Stefan Behrens

    Stefan Behrens
     
  • BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
    btrfs_foo() functions, which read and write specific fields in the
    extent buffer.

    The total number of set/get functions is ~200, but in fact we only
    need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.

    It results in redunction of ~37K bytes.

    text data bss dec hex filename
    629661 12489 216 642366 9cd3e fs/btrfs/btrfs.o.orig
    592637 12489 216 605342 93c9e fs/btrfs/btrfs.o

    Signed-off-by: Li Zefan

    Li Zefan
     
  • The otime field is not zeroed, so users will see random otime in an old
    filesystem with a new kernel which has otime support in the future.

    The reserved bytes are also not zeroed, and we'll have compatibility
    issue if we make use of those bytes.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
    which means every inode has the same BTRFS_I(inode)->free_space pointer.

    This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

    Signed-off-by: Li Zefan

    Li Zefan
     
  • Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

    Signed-off-by: Josef Bacik

    Anand Jain
     
  • When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
    rare case of using the deadlock avoidance code needed for the tree mod log.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Josef Bacik

    Jan Schmidt
     
  • If a block group is ro, do not count its entries in when we dump space info.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • Block group has ro attributes, make dump_space_info show it.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • Here is the whole story:
    1)
    A free space cache consists of two parts:
    o free space cache inode, which is special becase it's stored in root tree.
    o free space info, which is stored as the above inode's file data.

    But we only build up another new inode and does not flush its free space info
    onto disk when we _clear and setup_ free space cache, and this ends up with
    that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.

    And holding DC_SETUP means that we will not truncate this free space cache inode,
    which means the disk offset of its file extent will remain _unchanged_ at least
    until next transaction finishes committing itself.

    2)
    We can set a block group readonly when we relocate the block group.

    However,
    if the readonly block group covers the disk offset where our free space cache
    inode is going to write, it will force the free space cache inode into
    cow_file_range() and it'll end up hitting a BUG_ON.

    3)
    Due to the above analysis, we fix this bug by adding the missing dirty flag.

    4)
    However, it's not over, there is still another case, nospace_cache.

    With nospace_cache, we do not want to set dirty flag, instead we just truncate
    free space cache inode and bail out with setting cache state DC_WRITTEN.

    We can benifit from it since it saves us another 'pre-allocation' part which
    usually costs a lot.

    Signed-off-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • During disk balance, we prealloc new file extent for file data relocation,
    but we may fail in 'no available space' case, and it leads to flipping btrfs
    into readonly.

    It is not necessary to bail out and abort transaction since we do have several
    ways to rescue ourselves from ENOSPC case.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • Since root can be fetched via BTRFS_I macro directly, we can save an args
    for btrfs_is_free_space_inode().

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo