03 Jul, 2013

1 commit

  • For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
    SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
    matter in lseek_execute() to update the current file offset
    to the desired offset if it is valid, ceph also does the
    simliar things at ceph_llseek().

    To reduce the duplications, this patch make lseek_execute()
    public accessible so that we can call it directly from the
    underlying file systems.

    Thanks Dave Chinner for this suggestion.

    [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

    v2->v1:
    - Add kernel-doc comments for lseek_execute()
    - Call lseek_execute() in ceph->llseek()

    Signed-off-by: Jie Liu
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Ben Myers
    Cc: Ted Tso
    Cc: Hugh Dickins
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Sage Weil
    Signed-off-by: Al Viro

    Jie Liu
     

10 May, 2013

1 commit

  • Pull btrfs update from Chris Mason:
    "These are mostly fixes. The biggest exceptions are Josef's skinny
    extents and Jan Schmidt's code to rebuild our quota indexes if they
    get out of sync (or you enable quotas on an existing filesystem).

    The skinny extents are off by default because they are a new variation
    on the extent allocation tree format. btrfstune -x enables them, and
    the new format makes the extent allocation tree about 30% smaller.

    I rebased this a few days ago to rework Dave Sterba's crc checks on
    the super block, but almost all of these go back to rc6, since I
    though 3.9 was due any minute.

    The biggest missing fix is the tracepoint bug that was hit late in
    3.9. I ran into problems with that in overnight testing and I'm still
    tracking it down. I'll definitely have that fixed for rc2."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
    Btrfs: allow superblock mismatch from older mkfs
    btrfs: enhance superblock checks
    btrfs: fix misleading variable name for flags
    btrfs: use unsigned long type for extent state bits
    Btrfs: improve the loop of scrub_stripe
    btrfs: read entire device info under lock
    btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
    btrfs: handle errors returned from get_tree_block_key
    btrfs: make static code static & remove dead code
    Btrfs: deal with errors in write_dev_supers
    Btrfs: remove almost all of the BUG()'s from tree-log.c
    Btrfs: deal with free space cache errors while replaying log
    Btrfs: automatic rescan after "quota enable" command
    Btrfs: rescan for qgroups
    Btrfs: split btrfs_qgroup_account_ref into four functions
    Btrfs: allocate new chunks if the space is not enough for global rsv
    Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
    btrfs: move leak debug code to functions
    Btrfs: return free space in cow error path
    Btrfs: set UUID in root_item for created trees
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

07 May, 2013

4 commits

  • Big patch, but all it does is add statics to functions which
    are in fact static, then remove the associated dead-code fallout.

    removed functions:

    btrfs_iref_to_path()
    __btrfs_lookup_delayed_deletion_item()
    __btrfs_search_delayed_insertion_item()
    __btrfs_search_delayed_deletion_item()
    find_eb_for_page()
    btrfs_find_block_group()
    range_straddles_pages()
    extent_range_uptodate()
    btrfs_file_extent_length()
    btrfs_scrub_cancel_devid()
    btrfs_start_transaction_lflush()

    btrfs_print_tree() is left because it is used for debugging.
    btrfs_start_transaction_lflush() and btrfs_reada_detach() are
    left for symmetry.

    ulist.c functions are left, another patch will take care of those.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Josef Bacik

    Eric Sandeen
     
  • If argument 'trans' is unnecessary in the function where
    fixup_low_keys() is called, 'trans' is deleted.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik

    Tsutomu Itoh
     
  • A user sent me a btrfs-image of a file system that was panicing on mount during
    the log recovery. I had originally thought these problems were from a bug in
    the free space cache code, but that was just a symptom of the problem. The
    problem is if your application does something like this

    [prealloc][prealloc][prealloc]

    the internal extent maps will merge those all together into one extent map, even
    though on disk they are 3 separate extents. So if you go to write into one of
    these ranges the extent map will be right since we use the physical extent when
    doing the write, but when we log the extents they will use the wrong sizes for
    the remainder prealloc space. If this doesn't happen to trip up the free space
    cache (which it won't in a lot of cases) then you will get bogus entries in your
    extent tree which will screw stuff up later. The data and such will still work,
    but everything else is broken. This patch fixes this by not allowing extents
    that are on the modified list to be merged. This has the side effect that we
    are no longer adding everything to the modified list all the time, which means
    we now have to call btrfs_drop_extents every time we log an extent into the
    tree. So this allows me to drop all this speciality code I was using to get
    around calling btrfs_drop_extents. With this patch the testcase I've created no
    longer creates a bogus file system after replaying the log. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When logging changed extents I was logging ram_bytes as the current length,
    which isn't correct, it's supposed to be the ram bytes of the original extent.
    This is for compression where even if we split the extent we need to know the
    ram bytes so when we uncompress the extent we know how big it will be. This was
    still working out right with compression for some reason but I think we were
    getting lucky. It was definitely off for prealloc which is why I noticed it,
    btrfsck was complaining about it. With this patch btrfsck no longer complains
    after a log replay. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

10 Apr, 2013

1 commit


30 Mar, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "We've had a busy two weeks of bug fixing. The biggest patches in here
    are some long standing early-enospc problems (Josef) and a very old
    race where compression and mmap combine forces to lose writes (me).
    I'm fairly sure the mmap bug goes all the way back to the introduction
    of the compression code, which is proof that fsx doesn't trigger every
    possible mmap corner after all.

    I'm sure you'll notice one of these is from this morning, it's a small
    and isolated use-after-free fix in our scrub error reporting. I
    double checked it here."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: don't drop path when printing out tree errors in scrub
    Btrfs: fix wrong return value of btrfs_lookup_csum()
    Btrfs: fix wrong reservation of csums
    Btrfs: fix double free in the btrfs_qgroup_account_ref()
    Btrfs: limit the global reserve to 512mb
    Btrfs: hold the ordered operations mutex when waiting on ordered extents
    Btrfs: fix space accounting for unlink and rename
    Btrfs: fix space leak when we fail to reserve metadata space
    Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
    Btrfs: fix race between mmap writes and compression
    Btrfs: fix memory leak in btrfs_create_tree()
    Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
    Btrfs: fix missing qgroup reservation before fallocating
    Btrfs: handle a bogus chunk tree nicely
    Btrfs: update to use fs_state bit

    Linus Torvalds
     

22 Mar, 2013

1 commit

  • Steps to reproduce:
    mkfs.btrfs
    mount
    btrfs quota enable
    btrfs sub create /subv
    btrfs qgroup limit 10M /subv
    fallocate --length 20M /subv/data

    For the above example, fallocating will return successfully which
    is not expected, we try to fix it by doing qgroup reservation before
    fallocating.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     

18 Mar, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "Eric's rcu barrier patch fixes a long standing problem with our
    unmount code hanging on to devices in workqueue helpers. Liu Bo
    nailed down a difficult assertion for in-memory extent mappings."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix warning of free_extent_map
    Btrfs: fix warning when creating snapshots
    Btrfs: return as soon as possible when edquot happens
    Btrfs: return EIO if we have extent tree corruption
    btrfs: use rcu_barrier() to wait for bdev puts at unmount
    Btrfs: remove btrfs_try_spin_lock
    Btrfs: get better concurrency for snapshot-aware defrag work

    Linus Torvalds
     

16 Mar, 2013

1 commit

  • Users report that an extent map's list is still linked when it's actually
    going to be freed from cache.

    The story is that

    a) when we're going to drop an extent map and may split this large one into
    smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
    that it's on the list to be logged, then the smaller ems split from it will also
    be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.

    b) we'll keep ems from unlinking the list and freeing when they are flagged with
    EXTENT_FLAG_LOGGING, because the log code holds one reference.

    The end result is the warning, but the truth is that we set the flag
    EXTENT_FLAG_LOGGING only during fsync.

    So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.

    Reported-by: Johannes Hirte
    Reported-by: Darrick J. Wong
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

03 Mar, 2013

1 commit

  • Pull btrfs update from Chris Mason:
    "The biggest feature in the pull is the new (and still experimental)
    raid56 code that David Woodhouse started long ago. I'm still working
    on the parity logging setup that will avoid inconsistent parity after
    a crash, so this is only for testing right now. But, I'd really like
    to get it out to a broader audience to hammer out any performance
    issues or other problems.

    scrub does not yet correct errors on raid5/6 either.

    Josef has another pass at fsync performance. The big change here is
    to combine waiting for metadata with waiting for data, which is a big
    latency win. It is also step one toward using atomics from the
    hardware during a commit.

    Mark Fasheh has a new way to use btrfs send/receive to send only the
    metadata changes. SUSE is using this to make snapper more efficient
    at finding changes between snapshosts.

    Snapshot-aware defrag is also included.

    Otherwise we have a large number of fixes and cleanups. Eric Sandeen
    wins the award for removing the most lines, and I'm hoping we steal
    this idea from XFS over and over again."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
    btrfs: fixup/remove module.h usage as required
    Btrfs: delete inline extents when we find them during logging
    btrfs: try harder to allocate raid56 stripe cache
    Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
    Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
    Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
    Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
    Btrfs: fix missing deleted items in btrfs_clean_quota_tree
    btrfs: use only inline_pages from extent buffer
    Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
    Btrfs: fix wrong reserved space in qgroup during snap/subv creation
    Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
    btrfs: remove a printk from scan_one_device
    Btrfs: fix NULL pointer after aborting a transaction
    Btrfs: fix memory leak of log roots
    Btrfs: copy everything if we've created an inline extent
    btrfs: cleanup for open-coded alignment
    Btrfs: do not change inode flags in rename
    Btrfs: use reserved space for creating a snapshot
    clear chunk_alloc flag on retryable failure
    ...

    Linus Torvalds
     

27 Feb, 2013

2 commits

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     
  • Though most of the btrfs codes are using ALIGN macro for page alignment,
    there are still some codes using open-coded alignment like the
    following:
    ------
    u64 mask = ((u64)root->stripesize - 1);
    u64 ret = (val + mask) & ~mask;
    ------
    Or even hidden one:
    ------
    num_bytes = (end - start + blocksize) & ~(blocksize - 1);
    ------

    Sometimes these open-coded alignment is not so easy to understand for
    newbie like me.

    This commit changes the open-coded alignment to the ALIGN macro for a
    better readability.

    Also there is a previous patch from David Sterba with similar changes,
    but the patch is for 3.2 kernel and seems not merged.
    http://www.spinics.net/lists/linux-btrfs/msg12747.html

    Cc: David Sterba
    Signed-off-by: Qu Wenruo
    Signed-off-by: Josef Bacik

    Qu Wenruo
     

23 Feb, 2013

1 commit


21 Feb, 2013

4 commits

  • If we remount the fs to close the auto defragment or make the fs R/O,
    we should stop the auto defragment.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • …fs-next into for-linus-3.9

    Signed-off-by: Chris Mason <chris.mason@fusionio.com>

    Conflicts:
    fs/btrfs/disk-io.c

    Chris Mason
     
  • Miao made the ordered operations stuff run async, which introduced a
    deadlock where we could get somebody (sync) racing in and committing the
    transaction while a commit was already happening. The new committer would
    try and flush ordered operations which would hang waiting for the commit to
    finish because it is done asynchronously and no longer inherits the callers
    trans handle. To fix this we need to make the ordered operations list a per
    transaction list. We can get new inodes added to the ordered operation list
    by truncating them and then having another process writing to them, so this
    makes it so that anybody trying to add an ordered operation _must_ start a
    transaction in order to add itself to the list, which will keep new inodes
    from getting added to the ordered operations list after we start committing.
    This should fix the deadlock and also keeps us from doing a lot more work
    than we need to during commit. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • There is no lock to protect fs_info->fs_state, it will introduce
    some problems, such as the value may be covered by the other task
    when several tasks modify it. For example:
    Task0 - CPU0 Task1 - CPU1
    mov %fs_state rax
    or $0x1 rax
    mov %fs_state rax
    or $0x2 rax
    mov rax %fs_state
    mov rax %fs_state
    The expected value is 3, but in fact, it is 2.

    Though this problem doesn't happen now (because there is only one
    flag currently), the code is error prone, if we add other flags,
    the above problem will happen to a certainty.

    Now we use bit operation for it to fix the above problem.
    In this way, we can make the code more robust and be easy to
    add new flags.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

20 Feb, 2013

2 commits

  • The header file will then be installed under /usr/include/linux so that
    userspace applications can refer to Btrfs ioctls by name and use the same
    structs used internally in the kernel.

    Signed-off-by: Filipe Brandenburger
    Signed-off-by: Josef Bacik

    Filipe Brandenburger
     
  • Since we don't actually copy the extent information from the source tree in
    the fast case we don't need to wait for ordered io to be completed in order
    to fsync, we just need to wait for the io to be completed. So when we're
    logging our file just attach all of the ordered extents to the log, and then
    when the log syncs just wait for IO_DONE on the ordered extents and then
    write the super. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

08 Feb, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "We've got corner cases for updating i_size that ceph was hitting,
    error handling for quotas when we run out of space, a very subtle
    snapshot deletion race, a crash while removing devices, and one
    deadlock between subvolume creation and the sb_internal code (thanks
    lockdep)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: move d_instantiate outside the transaction during mksubvol
    Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
    Btrfs: fix possible stale data exposure
    Btrfs: fix missing i_size update
    Btrfs: fix race between snapshot deletion and getting inode
    Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
    Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
    Btrfs: do not merge logged extents if we've removed them from the tree
    btrfs: don't try to notify udev about missing devices

    Linus Torvalds
     

06 Feb, 2013

2 commits

  • While running snapshot testscript created by Mitch and David,
    the race between autodefrag and snapshot deletion can lead to
    corruption of dead_root list so that we can get crash on
    btrfs_clean_old_snapshots().

    And besides autodefrag, scrub also does the same thing, ie. read
    root first and get inode.

    Here is the story(take autodefrag as an example):
    (1) when we delete a snapshot or subvolume, it will set its root's
    refs to zero and do a iput() on its own inode, and if this inode happens
    to be the only active in-meory one in root's inode rbtree, it will add
    itself to the global dead_roots list for later cleanup.

    (2) after (1), the autodefrag thread may read another inode for defrag
    and the inode is just in the deleted snapshot/subvolume, but all of these
    are without checking if the root is still valid(refs > 0). So the end up
    result is adding the deleted snapshot/subvolume's root to the global
    dead_roots list AGAIN.

    Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.

    So all we need to do is to take the lock to protect 'read root and get inode',
    since we synchronize to wait for the rcu grace period before adding something
    to the global dead_roots list.

    Reported-by: Mitch Harder
    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
    decrease ->sync_writers, because we have not increased it. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

26 Jan, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "It turns out that we had two crc bugs when running fsx-linux in a
    loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
    all down. Miao also has a new OOM fix in this v2 pull as well.

    Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
    and resuming a running balance across drives.

    Josef's orphan truncate patch fixes an obscure corruption we'd see
    during xfstests.

    Arne's patches address problems with subvolume quotas. If the user
    destroys quota groups incorrectly the FS will refuse to mount.

    The rest are smaller fixes and plugs for memory leaks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
    Btrfs: fix repeated delalloc work allocation
    Btrfs: fix wrong max device number for single profile
    Btrfs: fix missed transaction->aborted check
    Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
    Btrfs: put csums on the right ordered extent
    Btrfs: use right range to find checksum for compressed extents
    Btrfs: fix panic when recovering tree log
    Btrfs: do not allow logged extents to be merged or removed
    Btrfs: fix a regression in balance usage filter
    Btrfs: prevent qgroup destroy when there are still relations
    Btrfs: ignore orphan qgroup relations
    Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
    Btrfs: fix unlock order in btrfs_ioctl_rm_dev
    Btrfs: fix unlock order in btrfs_ioctl_resize
    Btrfs: fix "mutually exclusive op is running" error code
    Btrfs: bring back balance pause/resume logic
    btrfs: update timestamps on truncate()
    btrfs: fix btrfs_cont_expand() freeing IS_ERR em
    Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
    Btrfs: fix off-by-one in lseek
    ...

    Linus Torvalds
     

15 Jan, 2013

2 commits


19 Dec, 2012

1 commit

  • Pull btrfs update from Chris Mason:
    "A big set of fixes and features.

    In terms of line count, most of the code comes from Stefan, who added
    the ability to replace a single drive in place. This is different
    from how btrfs normally replaces drives, and is much much much faster.

    Josef is plowing through our synchronous write performance. This pull
    request does not include the DIO_OWN_WAITING patch that was discussed
    on the list, but it has a number of other improvements to cut down our
    latencies and CPU time during fsync/O_DIRECT writes.

    Miao Xie has a big series of fixes and is spreading out ordered
    operations over more CPUs. This improves performance and reduces
    contention.

    I've put in fixes for error handling around hash collisions. These
    are going back to individual stable kernels as I test against them.

    Otherwise we have a lot of fixes and cleanups, thanks everyone!
    raid5/6 is being rebased against the device replacement code. I'll
    have it posted this Friday along with a nice series of benchmarks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
    Btrfs: fix a bug of per-file nocow
    Btrfs: fix hash overflow handling
    Btrfs: don't take inode delalloc mutex if we're a free space inode
    Btrfs: fix autodefrag and umount lockup
    Btrfs: fix permissions of empty files not affected by umask
    Btrfs: put raid properties into global table
    Btrfs: fix BUG() in scrub when first superblock reading gives EIO
    Btrfs: do not call file_update_time in aio_write
    Btrfs: only unlock and relock if we have to
    Btrfs: use tokens where we can in the tree log
    Btrfs: optimize leaf_space_used
    Btrfs: don't memset new tokens
    Btrfs: only clear dirty on the buffer if it is marked as dirty
    Btrfs: move checks in set_page_dirty under DEBUG
    Btrfs: log changed inodes based on the extent map tree
    Btrfs: add path->really_keep_locks
    Btrfs: do not mark ems as prealloc if we are writing to them
    Btrfs: keep track of the extents original block length
    Btrfs: inline csums if we're fsyncing
    Btrfs: don't bother copying if we're only logging the inode
    ...

    Linus Torvalds
     

18 Dec, 2012

1 commit


17 Dec, 2012

9 commits

  • This starts a transaction and dirties the inode everytime we call it, which
    is super expensive if you have a write heavy workload. We will be updating
    the inode when the IO completes and we reserve the space for the inode
    update when we reserve space for the write, so there is no chance of loss of
    information or enospc issues. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We don't really need to copy extents from the source tree since we have all
    of the information already available to us in the extent_map tree. So
    instead just write the extents straight to the log tree and don't bother to
    copy the extent items from the source tree.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If we've written to a prealloc extent we need to know the original block len
    for the extent. We can't figure this out currently since ->block_len is
    just set to the extent length. So introduce ->orig_block_len so that we
    know how many bytes were in the original extent for proper extent logging
    that future patches will need. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The tree logging stuff needs the csums to be on the ordered extents in order
    to log them properly, so mark that we're sync and inline the csum creation
    so we don't have to wait on the csumming to be done when logging extents
    that are still in flight. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Since we can pre-allocate the space past EOF, we should be able to reclaim
    that space if we need. This patch implements it by removing the EOF check.

    Though the manual of fallocate command says we can use truncate command to
    reclaim the pre-allocated space which past EOF, but because truncate command
    changes the file size, we must run several commands to reclaim the space if we
    don't want to change the file size, so it is not a good choice.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Steps to reproduce:
    # mkfs.btrfs
    # mount
    # dd if=/dev/zero of=/ bs=512 seek=5 count=8
    # fallocate -p -o 2048 -l 16384 /
    # dd if=/dev/zero of=/ bs=4096 seek=3 count=8 conv=notrunc,nocreat
    # umount
    # dmesg
    WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330

    The reason is that we inputed a range which is beyond the end of the file. And
    because the end of this range was not page-aligned, we had to truncate the last
    page in this range, this operation is similar to a buffered file write. In other
    words, we reserved enough space and clear the data which was in the hole range
    on that page. But when we expanded that test file, write the data into the same
    page, we forgot that we have reserved enough space for the buffered write of
    that page because in most cases there is no page that is beyond the end of
    the file. As a result, we reserved the space twice.

    In fact, we needn't truncate the page if it is beyond the end of the file, just
    release the allocated space in that range. Fix the above problem by this way.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • (start + len) is the start of the adjacent extent, not the end of the current
    extent, so we should not use it to check the hole is on the same page or not.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • alloc_end is not the real end of the current extent, it is the start of the
    next adjoining extent. So we needn't +1 when calculating the size the space
    that is about to be reserved.

    Signed-off-by: Miao Xie
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The kernel developers have implemented some often-used align macros, we should
    use them instead of the complex code.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie