18 Dec, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "In this pile:

    - autofs-namespace series
    - dedupe stuff
    - more struct path constification"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
    ocfs2: charge quota for reflinked blocks
    ocfs2: fix bad pointer cast
    ocfs2: always unlock when completing dio writes
    ocfs2: don't eat io errors during _dio_end_io_write
    ocfs2: budget for extent tree splits when adding refcount flag
    ocfs2: prohibit refcounted swapfiles
    ocfs2: add newlines to some error messages
    ocfs2: convert inode refcount test to a helper
    simple_write_end(): don't zero in short copy into uptodate
    exofs: don't mess with simple_write_{begin,end}
    9p: saner ->write_end() on failing copy into non-uptodate page
    fix gfs2_stuffed_write_end() on short copies
    fix ceph_write_end()
    nfs_write_end(): fix handling of short copies
    vfs: refactor clone/dedupe_file_range common functions
    fs: try to clone files first in vfs_copy_file_range
    vfs: misc struct path constification
    namespace.c: constify struct path passed to a bunch of primitives
    quota: constify struct path in quota_on
    ...

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • …dmanana/linux into for-linus-4.10

    Patches queued up by Filipe:

    The most important change is still the fix for the extent tree
    corruption that happens due to balance when qgroups are enabled (a
    regression introduced in 4.7 by a fix for a regression from the last
    qgroups rework). This has been hitting SLE and openSUSE users and QA
    very badly, where transactions keep getting aborted when running
    delayed references leaving the root filesystem in RO mode and nearly
    unusable. There are fixes here that allow us to run xfstests again
    with the integrity checker enabled, which has been impossible since 4.8
    (apparently I'm the only one running xfstests with the integrity
    checker enabled, which is useful to validate dirtied leafs, like
    checking if there are keys out of order, etc). The rest are just some
    trivial fixes, most of them tagged for stable, and two cleanups.

    Signed-off-by: Chris Mason <clm@fb.com>

    Chris Mason
     

10 Dec, 2016

1 commit

  • A clone is a perfectly fine implementation of a file copy, so most
    file systems just implement the copy that way. Instead of duplicating
    this logic move it to the VFS. Currently btrfs and XFS implement copies
    the same way as clones and there is no behavior change for them, cifs
    only implements clones and grow support for copy_file_range with this
    patch. NFS implements both, so this will allow copy_file_range to work
    on servers that only implement CLONE and be lot more efficient on servers
    that implements CLONE and COPY.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

06 Dec, 2016

5 commits


30 Nov, 2016

4 commits

  • The hole punching can result in adding new leafs (and as a consequence
    new nodes) to the tree because when we find file extent items that span
    beyond the hole range we may end up not deleting them (just adjusting
    them, reducing their range by reducing their length or increasing their
    offset field) and add new file extent items representing holes.

    So after splitting a leaf (therefore creating a new one) to insert a new
    file extent item representing a hole, a new node might be added to each
    level of the tree in the worst case scenario (since there's a new key
    and every parent node was full).

    For example if a file has an extent item representing the range 0 to 64Mb
    and we punch a hole in the range 1Mb to 20Mb, the existing extent item is
    duplicated and one of the copies is adjusted to represent the range 0 to
    1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a
    new file extent item representing a hole in the range 1Mb to 20Mb is
    inserted.

    Fix this by using btrfs_calc_trans_metadata_size() instead of
    btrfs_calc_trunc_metadata_size(), so that enough metadata space is
    reserved for the worst possible case.

    Signed-off-by: Robbie Ko
    Reviewed-by: Filipe Manana
    Signed-off-by: Filipe Manana
    [Modified changelog for clarity and correctness]

    Robbie Ko
     
  • At this point we will have dropped extent entries from the file, so if we fail
    to insert the new hole entries then we are leaving the fs in a corrupt state
    (albeit an easily fixed one). Abort the transaciton if this happens so we can
    avoid corrupting the fs. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • In order to do hole punching we have a block reserve to hold the reservation we
    need to drop the extents in our range. Since we could end up dropping a lot of
    extents we set rsv->failfast so we can just loop around again and drop the
    remaining of the range. Unfortunately we unconditionally fill the hole extents
    in and start from the last extent we encountered, which we may or may not have
    dropped. So this can result in overlapping file extent entries, which can be
    tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
    fill_holes() after the search, or in btrfs_set_item_key_safe() in
    btrfs_drop_extent() at a later time by an unrelated task. Fix this by only
    setting drop_end to the last extent we did actually drop. This way our holes
    are filled in properly for the range that we did drop, and the rest of the range
    that remains to be dropped is actually dropped. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Signed-off-by: David Sterba

    David Sterba
     

12 Oct, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "This is a big variety of fixes and cleanups.

    Liu Bo continues to fixup fuzzer related problems, and some of Josef's
    cleanups are prep for his bigger extent buffer changes (slated for
    v4.10)"

    * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
    Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
    Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
    Btrfs: don't BUG() during drop snapshot
    btrfs: fix btrfs_no_printk stub helper
    Btrfs: memset to avoid stale content in btree leaf
    btrfs: parent_start initialization cleanup
    btrfs: Remove already completed TODO comment
    btrfs: Do not reassign count in btrfs_run_delayed_refs
    btrfs: fix a possible umount deadlock
    Btrfs: fix memory leak in do_walk_down
    btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
    btrfs: convert send's verbose_printk to btrfs_debug
    btrfs: convert pr_* to btrfs_* where possible
    btrfs: convert printk(KERN_* to use pr_* calls
    btrfs: unsplit printed strings
    btrfs: clean the old superblocks before freeing the device
    Btrfs: kill BUG_ON in run_delayed_tree_ref
    Btrfs: don't leak reloc root nodes on error
    btrfs: squash lines for simple wrapper functions
    Btrfs: improve check_node to avoid reading corrupted nodes
    ...

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

26 Sep, 2016

2 commits


16 Sep, 2016

1 commit


25 Aug, 2016

2 commits

  • Commit 44f714dae50a ("Btrfs: improve performance on fsync against new
    inode after rename/unlink"), which landed in 4.8-rc2, introduced a
    possibility for a deadlock due to double locking of an inode's log mutex
    by the same task, which lockdep reports with:

    [23045.433975] =============================================
    [23045.434748] [ INFO: possible recursive locking detected ]
    [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
    [23045.436044] ---------------------------------------------
    [23045.436044] xfs_io/3688 is trying to acquire lock:
    [23045.436044] (&ei->log_mutex){+.+...}, at: [] btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044]
    but task is already holding lock:
    [23045.436044] (&ei->log_mutex){+.+...}, at: [] btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044]
    other info that might help us debug this:
    [23045.436044] Possible unsafe locking scenario:

    [23045.436044] CPU0
    [23045.436044] ----
    [23045.436044] lock(&ei->log_mutex);
    [23045.436044] lock(&ei->log_mutex);
    [23045.436044]
    *** DEADLOCK ***

    [23045.436044] May be due to missing lock nesting notation

    [23045.436044] 3 locks held by xfs_io/3688:
    [23045.436044] #0: (&sb->s_type->i_mutex_key#15){+.+...}, at: [] btrfs_sync_file+0x14e/0x425 [btrfs]
    [23045.436044] #1: (sb_internal#2){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0
    [23045.436044] #2: (&ei->log_mutex){+.+...}, at: [] btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044]
    stack backtrace:
    [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
    [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
    [23045.436044] 0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
    [23045.436044] ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
    [23045.436044] 0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
    [23045.436044] Call Trace:
    [23045.436044] [] dump_stack+0x67/0x90
    [23045.436044] [] __lock_acquire+0xcbb/0xe4e
    [23045.436044] [] ? mark_lock+0x24/0x201
    [23045.436044] [] ? mark_held_locks+0x5e/0x74
    [23045.436044] [] lock_acquire+0x12f/0x1c3
    [23045.436044] [] ? lock_acquire+0x12f/0x1c3
    [23045.436044] [] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044] [] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044] [] mutex_lock_nested+0x77/0x3a7
    [23045.436044] [] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044] [] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
    [23045.436044] [] btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044] [] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
    [23045.436044] [] ? vprintk_emit+0x453/0x465
    [23045.436044] [] btrfs_log_inode+0x66e/0xc95 [btrfs]
    [23045.436044] [] log_new_dir_dentries+0x26c/0x359 [btrfs]
    [23045.436044] [] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
    [23045.436044] [] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
    [23045.436044] [] btrfs_sync_file+0x304/0x425 [btrfs]
    [23045.436044] [] vfs_fsync_range+0x8c/0x9e
    [23045.436044] [] vfs_fsync+0x1c/0x1e
    [23045.436044] [] do_fsync+0x31/0x4a
    [23045.436044] [] SyS_fsync+0x10/0x14
    [23045.436044] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [23045.436044] [] ? trace_hardirqs_off_caller+0x3f/0xaa

    An example reproducer for this is:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ mkdir /mnt/dir
    $ touch /mnt/dir/foo
    $ sync
    $ mv /mnt/dir/foo /mnt/dir/bar
    $ touch /mnt/dir/foo
    $ xfs_io -c "fsync" /mnt/dir/bar

    This is because while logging the inode of file bar we end up logging its
    parent directory (since its inode has an unlink_trans field matching the
    current transaction id due to the rename operation), which in turn logs
    the inodes for all its new dentries, so that the new inode for the new
    file named foo gets logged which in turn triggered another logging attempt
    for the inode we are fsync'ing, since that inode had an old name that
    corresponds to the name of the new inode.

    So fix this by ensuring that when logging the inode for a new dentry that
    has a name matching an old name of some other inode, we don't log again
    the original inode that we are fsync'ing.

    Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • This patch can fix some false ENOSPC errors, below test script can
    reproduce one false ENOSPC error:
    #!/bin/bash
    dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
    dev=$(losetup --show -f fs.img)
    mkfs.btrfs -f -M $dev
    mkdir /tmp/mntpoint
    mount $dev /tmp/mntpoint
    cd /tmp/mntpoint
    xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

    Above script will fail for ENOSPC reason, but indeed fs still has free
    space to satisfy this request. Please see call graph:
    btrfs_fallocate()
    |-> btrfs_alloc_data_chunk_ondemand()
    | bytes_may_use += 64M
    |-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
    |-> btrfs_add_reserved_bytes()
    | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
    | change bytes_may_use, and bytes_reserved += 64M. Now
    | bytes_may_use + bytes_reserved == 128M, which is greater
    | than btrfs_space_info's total_bytes, false enospc occurs.
    | Note, the bytes_may_use decrease operation will be done in
    | end of btrfs_fallocate(), which is too late.

    Here is another simple case for buffered write:
    CPU 1 | CPU 2
    |
    |-> cow_file_range() |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent() | |
    | | |
    | | |
    | ..... | |-> btrfs_check_data_free_space()
    | |
    | |
    |-> extent_clear_unlock_delalloc() |

    In CPU 1, btrfs_reserve_extent()->find_free_extent()->
    btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
    operation will be delayed to be done in extent_clear_unlock_delalloc().
    Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
    btrfs_check_data_free_space() tries to reserve 100MB data space.
    If
    100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
    data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
    data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
    btrfs_check_data_free_space() will try to allcate new data chunk or call
    btrfs_start_delalloc_roots(), or commit current transaction in order to
    reserve some free space, obviously a lot of work. But indeed it's not
    necessary as long as decreasing bytes_may_use timely, we still have
    free space, decreasing 128M from bytes_may_use.

    To fix this issue, this patch chooses to update bytes_may_use for both
    data and metadata in btrfs_add_reserved_bytes(). For compress path, real
    extent length may not be equal to file content length, so introduce a
    ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
    btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
    file content length. Then compress path can update bytes_may_use
    correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
    and RESERVE_FREE.

    As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
    run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
    PREALLOC, we also need to update bytes_may_use, but can not pass
    EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
    here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
    to update btrfs_space_info's bytes_may_use.

    Meanwhile __btrfs_prealloc_file_range() will call
    btrfs_free_reserved_data_space() internally for both sucessful and failed
    path, btrfs_prealloc_file_range()'s callers does not need to call
    btrfs_free_reserved_data_space() any more.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Xiaoguang
     

06 Aug, 2016

1 commit


01 Aug, 2016

1 commit

  • When we start an fsync we start ordered extents for all delalloc ranges.
    However before attempting to log the inode, we only wait for those ordered
    extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
    is set in the inode's flags). This means that if an ordered extent
    completes with an IO error before we check if we can skip logging the
    inode, we will not catch and report the IO error to user space. This is
    because on an IO error, when the ordered extent completes we do not
    update the inode, so if the inode was not previously updated by the
    current transaction we end up not logging it through calls to fsync and
    therefore not check its mapping flags for the presence of IO errors.

    Fix this by checking for errors in the flags of the inode's mapping when
    we notice we can skip logging the inode.

    This caused sporadic failures in the test generic/331 (which explicitly
    tests for IO errors during an fsync call).

    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo

    Filipe Manana
     

26 Jul, 2016

3 commits

  • __btrfs_abort_transaction doesn't use its root parameter except to
    obtain an fs_info pointer. We can obtain that from trans->root->fs_info
    for now and from trans->fs_info in a later patch.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • btrfs_test_opt and friends only use the root pointer to access
    the fs_info. Let's pass the fs_info directly in preparation to
    eliminate similar patterns all over btrfs.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • BTRFS is using a variety of slab caches to satisfy internal needs.
    Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
    meaning allocations from the caches are going to be accounted as
    SReclaimable. At the same time btrfs is not registering any shrinkers
    whatsoever, thus preventing memory from the slabs to be shrunk. This
    means those caches are not in fact reclaimable.

    To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
    inode cache, since this one is being freed by the generic VFS super_block
    shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
    to better document the lifetime of the objects (it just translates
    to SLAB_RECLAIM_ACCOUNT).

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

21 Jul, 2016

1 commit

  • Commit 56244ef151c3cd11 was almost but not quite enough to fix the
    reservation math after btrfs_copy_from_user returned partial copies.

    Some users are still seeing warnings in btrfs_destroy_inode, and with a
    long enough test run I'm able to trigger them as well.

    This patch fixes the accounting math again, bringing it much closer to
    the way it was before the sectorsize conversion Chandan did. The
    problem is accounting for the offset into the page/sector when we do a
    partial copy. This one just uses the dirty_sectors variable which
    should already be updated properly.

    Signed-off-by: Chris Mason
    cc: stable@vger.kernel.org # v4.6+

    Chris Mason
     

08 Jul, 2016

1 commit

  • So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
    Not only this but it unconditionally changes the size of the block_rsv. This
    isn't a bug strictly speaking, but it makes truncate block rsv's look funny
    because every time we migrate bytes over its size grows, even though we only
    want it to be a specific size. So collapse this into one function that takes an
    update_size argument and make truncate and evict not update the size for
    consistency sake. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

25 Jun, 2016

1 commit

  • Pull btrfs fixes from Chris Mason:
    "I have a two part pull this time because one of the patches Dave
    Sterba collected needed to be against v4.7-rc2 or higher (we used
    rc4). I try to make my for-linus-xx branch testable on top of the
    last major so we can hand fixes to people on the list more easily, so
    I've split this pull in two.

    This first part has some fixes and two performance improvements that
    we've been testing for some time.

    Josef's two performance fixes are most notable. The transid tracking
    patch makes a big improvement on pretty much every workload"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: Force stripesize to the value of sectorsize
    btrfs: fix disk_i_size update bug when fallocate() fails
    Btrfs: fix error handling in map_private_extent_buffer
    Btrfs: fix error return code in btrfs_init_test_fs()
    Btrfs: don't do nocow check unless we have to
    btrfs: fix deadlock in delayed_ref_async_start
    Btrfs: track transid for delayed ref flushing

    Linus Torvalds
     

23 Jun, 2016

1 commit

  • Before we write into prealloc/nocow space we have to make sure that there are no
    references to the extents we are writing into, which means checking the extent
    tree and csum tree in the case of nocow. So we don't want to do the nocow dance
    unless we can't reserve data space, since it's a serious drag on performance.
    With the following sequence

    fallocate -l10737418240 /mnt/btrfs-test/file
    cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
    fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
    --end_fsync=1

    we get the worst case scenario where we have to fall back on to doing the check
    anyway.

    Without this patch
    lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
    write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

    With this patch
    lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
    write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

    We get twice the throughput, half of the runtime, and half of the average
    latency. Thanks,

    Signed-off-by: Josef Bacik
    [ PAGE_CACHE_ removal related fixups ]
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Josef Bacik
     

28 May, 2016

1 commit

  • Pull btrfs cleanups and fixes from Chris Mason:
    "We have another round of fixes and a few cleanups.

    I have a fix for short returns from btrfs_copy_from_user, which
    finally nails down a very hard to find regression we added in v4.6.

    Dave is pushing around gfp parameters, mostly to cleanup internal apis
    and make it a little more consistent.

    The rest are smaller fixes, and one speelling fixup patch"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
    Btrfs: fix handling of faults from btrfs_copy_from_user
    btrfs: fix string and comment grammatical issues and typos
    btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
    Btrfs: fix unexpected return value of fiemap
    Btrfs: free sys_array eb as soon as possible
    btrfs: sink gfp parameter to convert_extent_bit
    btrfs: make state preallocation more speculative in __set_extent_bit
    btrfs: untangle gotos a bit in convert_extent_bit
    btrfs: untangle gotos a bit in __clear_extent_bit
    btrfs: untangle gotos a bit in __set_extent_bit
    btrfs: sink gfp parameter to set_record_extent_bits
    btrfs: sink gfp parameter to set_extent_new
    btrfs: sink gfp parameter to set_extent_defrag
    btrfs: sink gfp parameter to set_extent_delalloc
    btrfs: sink gfp parameter to clear_extent_dirty
    btrfs: sink gfp parameter to clear_record_extent_bits
    btrfs: sink gfp parameter to clear_extent_bits
    btrfs: sink gfp parameter to set_extent_bits
    btrfs: make find_workspace warn if there are no workspaces
    btrfs: make find_workspace always succeed
    ...

    Linus Torvalds
     

27 May, 2016

1 commit

  • When btrfs_copy_from_user isn't able to copy all of the pages, we need
    to adjust our accounting to reflect the work that was actually done.

    Commit 2e78c927d79 changed around the decisions a little and we ended up
    skipping the accounting adjustments some of the time. This commit makes
    sure that when we don't copy anything at all, we still hop into
    the adjustments, and switches to release_bytes instead of write_bytes,
    since write_bytes isn't aligned.

    The accounting errors led to warnings during btrfs_destroy_inode:

    [ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
    [ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
    [ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23
    [ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
    [ 70.847542] 0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
    [ 70.847543] 0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
    [ 70.847547] ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
    [ 70.847547] Call Trace:
    [ 70.847550] [] dump_stack+0x51/0x78
    [ 70.847551] [] __warn+0xfd/0x120
    [ 70.847553] [] warn_slowpath_null+0x1d/0x20
    [ 70.847555] [] btrfs_destroy_inode+0x2b3/0x2c0
    [ 70.847556] [] ? __destroy_inode+0x71/0x140
    [ 70.847558] [] destroy_inode+0x43/0x70
    [ 70.847559] [] ? wake_up_bit+0x2f/0x40
    [ 70.847560] [] evict+0x148/0x1d0
    [ 70.847562] [] ? start_transaction+0x3de/0x460
    [ 70.847564] [] dispose_list+0x59/0x80
    [ 70.847565] [] evict_inodes+0x180/0x190
    [ 70.847566] [] ? __sync_filesystem+0x3f/0x50
    [ 70.847568] [] generic_shutdown_super+0x48/0x100
    [ 70.847569] [] ? woken_wake_function+0x20/0x20
    [ 70.847571] [] kill_anon_super+0x16/0x30
    [ 70.847573] [] btrfs_kill_super+0x1e/0x130
    [ 70.847574] [] deactivate_locked_super+0x4e/0x90
    [ 70.847576] [] deactivate_super+0x51/0x70
    [ 70.847577] [] cleanup_mnt+0x3f/0x80
    [ 70.847579] [] __cleanup_mnt+0x12/0x20
    [ 70.847581] [] task_work_run+0x68/0xa0
    [ 70.847582] [] exit_to_usermode_loop+0xd6/0xe0
    [ 70.847583] [] do_syscall_64+0xbd/0x170
    [ 70.847586] [] entry_SYSCALL64_slow_path+0x25/0x25

    This is the test program I used to force short returns from
    btrfs_copy_from_user

    void *dontneed(void *arg)
    {
    char *p = arg;
    int ret;

    while(1) {
    ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
    if (ret) {
    perror("madvise");
    exit(1);
    }
    }
    }

    int main(int ac, char **av) {
    int ret;
    int fd;
    char *filename;
    unsigned long offset;
    char *buf;
    int i;
    pthread_t tid;

    if (ac != 2) {
    fprintf(stderr, "usage: dammitdave filename\n");
    exit(1);
    }

    buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (buf == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    memset(buf, 'a', BUFSIZE);
    filename = av[1];

    ret = pthread_create(&tid, NULL, dontneed, buf);
    if (ret) {
    fprintf(stderr, "error %d from pthread_create\n", ret);
    exit(1);
    }

    ret = pthread_detach(tid);
    if (ret) {
    fprintf(stderr, "pthread detach failed %d\n", ret);
    exit(1);
    }

    while (1) {
    fd = open(filename, O_RDWR | O_CREAT, 0600);
    if (fd < 0) {
    perror("open");
    exit(1);
    }

    for (i = 0; i < ROUNDS; i++) {
    int this_write = BUFSIZE;

    offset = rand() % MAXSIZE;
    ret = pwrite(fd, buf, this_write, offset);
    if (ret < 0) {
    perror("pwrite");
    exit(1);
    } else if (ret != this_write) {
    fprintf(stderr, "short write to %s offset %lu ret %d\n",
    filename, offset, ret);
    exit(1);
    }
    if (i == ROUNDS - 1) {
    ret = sync_file_range(fd, offset, 4096,
    SYNC_FILE_RANGE_WRITE);
    if (ret < 0) {
    perror("sync_file_range");
    exit(1);
    }
    }
    }
    ret = ftruncate(fd, 0);
    if (ret < 0) {
    perror("ftruncate");
    exit(1);
    }
    ret = close(fd);
    if (ret) {
    perror("close");
    exit(1);
    }
    ret = unlink(filename);
    if (ret) {
    perror("unlink");
    exit(1);
    }

    }
    return 0;
    }

    Signed-off-by: Chris Mason
    Reported-by: Dave Jones
    Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335
    cc: stable@vger.kernel.org # v4.6
    Signed-off-by: Chris Mason

    Chris Mason
     

26 May, 2016

2 commits


22 May, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "This has our merge window series of cleanups and fixes. These target
    a wide range of issues, but do include some important fixes for
    qgroups, O_DIRECT, and fsync handling. Jeff Mahoney moved around a
    few definitions to make them easier for userland to consume.

    Also whiteout support is included now that issues with overlayfs have
    been cleared up.

    I have one more fix pending for page faults during btrfs_copy_from_user,
    but I wanted to get this bulk out the door first"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits)
    btrfs: fix memory leak during RAID 5/6 device replacement
    Btrfs: add semaphore to synchronize direct IO writes with fsync
    Btrfs: fix race between block group relocation and nocow writes
    Btrfs: fix race between fsync and direct IO writes for prealloc extents
    Btrfs: fix number of transaction units for renames with whiteout
    Btrfs: pin logs earlier when doing a rename exchange operation
    Btrfs: unpin logs if rename exchange operation fails
    Btrfs: fix inode leak on failure to setup whiteout inode in rename
    btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
    Btrfs: pin log earlier when renaming
    Btrfs: unpin log if rename operation fails
    Btrfs: don't do unnecessary delalloc flushes when relocating
    Btrfs: don't wait for unrelated IO to finish before relocation
    Btrfs: fix empty symlink after creating symlink and fsync parent dir
    Btrfs: fix for incorrect directory entries after fsync log replay
    btrfs: build fixup for qgroup_account_snapshot
    btrfs: qgroup: Fix qgroup accounting when creating snapshot
    Btrfs: fix fspath error deallocation
    btrfs: make find_workspace warn if there are no workspaces
    btrfs: make find_workspace always succeed
    ...

    Linus Torvalds
     

02 May, 2016

3 commits


28 Apr, 2016

2 commits


10 Apr, 2016

1 commit

  • Pull btrfs fixes from Chris Mason:
    "These are bug fixes, including a really old fsync bug, and a few trace
    points to help us track down problems in the quota code"

    * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix file/data loss caused by fsync after rename and new inode
    btrfs: Reset IO error counters before start of device replacing
    btrfs: Add qgroup tracing
    Btrfs: don't use src fd for printk
    btrfs: fallback to vmalloc in btrfs_compare_tree
    btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
    btrfs: Output more info for enospc_debug mount option
    Btrfs: fix invalid reference in replace_path
    Btrfs: Improve FL_KEEP_SIZE handling in fallocate

    Linus Torvalds