16 Aug, 2014

1 commit

  • Pull btrfs updates from Chris Mason:
    "These are all fixes I'd like to get out to a broader audience.

    The biggest of the bunch is Mark's quota fix, which is also in the
    SUSE kernel, and makes our subvolume quotas dramatically more
    accurate.

    I've been running xfstests with these against your current git
    overnight, but I'm queueing up longer tests as well"

    * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: disable strict file flushes for renames and truncates
    Btrfs: fix csum tree corruption, duplicate and outdated checksums
    Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
    Btrfs: fix compressed write corruption on enospc
    btrfs: correctly handle return from ulist_add
    btrfs: qgroup: account shared subtrees during snapshot delete
    Btrfs: read lock extent buffer while walking backrefs
    Btrfs: __btrfs_mod_ref should always use no_quota
    btrfs: adjust statfs calculations according to raid profiles

    Linus Torvalds
     

15 Aug, 2014

9 commits

  • Truncates and renames are often used to replace old versions of a file
    with new versions. Applications often expect this to be an atomic
    replacement, even if they haven't done anything to make sure the new
    version is fully on disk.

    Btrfs has strict flushing in place to make sure that renaming over an
    old file with a new file will fully flush out the new file before
    allowing the transaction commit with the rename to complete.

    This ordering means the commit code needs to be able to lock file pages,
    and there are a few paths in the filesystem where we will try to end a
    transaction with the page lock held. It's rare, but these things can
    deadlock.

    This patch removes the ordered flushes and switches to a best effort
    filemap_flush like ext4 uses. It's not perfect, but it should fix the
    deadlocks.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Under rare circumstances we can end up leaving 2 versions of a checksum
    for the same file extent range.

    The reason for this is that after calling btrfs_next_leaf we process
    slot 0 of the leaf it returns, instead of processing the slot set in
    path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
    btrfs_next_leaf() releases the path and before it searches for the next
    leaf, another task might cause a split of the next leaf, which migrates
    some of its keys to the leaf we were processing before calling
    btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
    same leaf but with path->slots[0] having a slot number corresponding
    to the first new key it got, that is, a slot number that didn't exist
    before calling btrfs_next_leaf(), as the leaf now has more keys than
    it had before. So we must really process the returned leaf starting at
    path->slots[0] always, as it isn't always 0, and the key at slot 0 can
    have an offset much lower than our search offset/bytenr.

    For example, consider the following scenario, where we have:

    sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
    four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

    Leaf N:

    slot = 0 slot = btrfs_header_nritems() - 1
    |-------------------------------------------------------------------|
    | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
    |-------------------------------------------------------------------|

    Leaf N + 1:

    slot = 0 slot = btrfs_header_nritems() - 1
    |--------------------------------------------------------------------|
    | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
    |--------------------------------------------------------------------|

    Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
    find the next highest key, which releases the current path and then searches
    for that next key. However after releasing the path and before finding that
    next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
    to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
    btrfs_next_leaf() will returns us a path again with leaf N but with the slot
    pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
    is then:

    slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
    |----------------------------------------------------------------------------------------------------|
    | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
    |----------------------------------------------------------------------------------------------------|

    And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
    into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
    (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

    and

    ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

    In other words, we insert a new csum item in the tree with key
    (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
    for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
    because the item with key (CSUM CSUM 40161280) (the one that was moved from
    leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
    bytes of our data and won't get those old checksums removed.

    So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
    and breaks the logical rule:

    Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

    An obvious bad effect of this is that a subsequent csum tree lookup to get
    the checksum of any of the blocks with logical offset of 40161280, 40165376
    or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

    Cc: stable@vger.kernel.org
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We've got bug reports that btrfs crashes when quota is enabled on
    32bit kernel, typically with the Oops like below:
    BUG: unable to handle kernel NULL pointer dereference at 00000004
    IP: [] find_parent_nodes+0x360/0x1380 [btrfs]
    *pde = 00000000
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1
    Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
    task: f1478130 ti: f147c000 task.ti: f147c000
    EIP: 0060:[] EFLAGS: 00010213 CPU: 0
    EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
    EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
    ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
    Stack:
    00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
    00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
    00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
    Call Trace:
    [] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
    [] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
    [] normal_work_helper+0xc8/0x270 [btrfs]
    [] process_one_work+0x11b/0x390
    [] worker_thread+0x101/0x340
    [] kthread+0x9b/0xb0
    [] ret_from_kernel_thread+0x21/0x30
    [] kthread_create_on_node+0x110/0x110

    This indicates a NULL corruption in prefs_delayed list. The further
    investigation and bisection pointed that the call of ulist_add_merge()
    results in the corruption.

    ulist_add_merge() takes u64 as aux and writes a 64bit value into
    old_aux. The callers of this function in backref.c, however, pass a
    pointer of a pointer to old_aux. That is, the function overwrites
    64bit value on 32bit pointer. This caused a NULL in the adjacent
    variable, in this case, prefs_delayed.

    Here is a quick attempt to band-aid over this: a new function,
    ulist_add_merge_ptr() is introduced to pass/store properly a pointer
    value instead of u64. There are still ugly void ** cast remaining
    in the callers because void ** cannot be taken implicitly. But, it's
    safer than explicit cast to u64, anyway.

    Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
    Cc: [v3.11+]
    Signed-off-by: Takashi Iwai
    Signed-off-by: Chris Mason

    Takashi Iwai
     
  • When failing to allocate space for the whole compressed extent, we'll
    fallback to uncompressed IO, but we've forgotten to redirty the pages
    which belong to this compressed extent, and these 'clean' pages will
    simply skip 'submit' part and go to endio directly, at last we got data
    corruption as we write nothing.

    Signed-off-by: Liu Bo
    Tested-By: Martin Steigerwald
    Signed-off-by: Chris Mason

    Liu Bo
     
  • ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
    doesn't take into account. As a result, that value can be bubbled up to
    callers, causing an error to be printed. Fix this by only returning the
    value of ulist_add() when it indicates an error.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • During its tree walk, btrfs_drop_snapshot() will skip any shared
    subtrees it encounters. This is incorrect when we have qgroups
    turned on as those subtrees need to have their contents
    accounted. In particular, the case we're concerned with is when
    removing our snapshot root leaves the subtree with only one root
    reference.

    In those cases we need to find the last remaining root and add
    each extent in the subtree to the corresponding qgroup exclusive
    counts.

    This patch implements the shared subtree walk and a new qgroup
    operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
    this type is encountered during qgroup accounting, we search for
    any root references to that extent and in the case that we find
    only one reference left, we go ahead and do the math on it's
    exclusive counts.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • Before processing the extent buffer, acquire a read lock on it, so
    that we're safe against concurrent updates on the extent buffer.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
    understand how snapshot delete was using it and assumed that we needed the
    quota operations there. With Mark's work this has turned out to be not the
    case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
    argument and make __btrfs_mod_ref call it's process function with no_quota set
    always. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This has been discussed in thread:
    http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528

    and this patch implements this proposal:
    http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536

    Works fine for "clean" raid profiles where the raid factor correction
    does the right job. Otherwise it's pessimistic and may show low space
    although there's still some left.

    The df nubmers are lightly wrong in case of mixed block groups, but this
    is not a major usecase and can be addressed later.

    The RAID56 numbers are wrong almost the same way as before and will be
    addressed separately.

    CC: Hugo Mills
    CC: cwillu
    CC: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     

12 Aug, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     

08 Aug, 2014

2 commits

  • There are a few d_obtain_alias callers that are using it to get the
    root of a filesystem which may already have an alias somewhere else.

    This is not the same as the filehandle-lookup case, and none of them
    actually need DCACHE_DISCONNECTED set.

    It isn't really a serious problem, but it would really be clearer if we
    reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

    In the btrfs case this was causing a spurious printk from
    nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
    dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
    manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting
    default subvol", and this replaces that workaround.

    Cc: Josef Bacik
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • RENAME_NOREPLACE is trivial to implement for most filesystems: switch over
    to ->rename2() and check for the supported flags. The rest is done by the
    VFS.

    Signed-off-by: Miklos Szeredi
    Cc: Chris Mason
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Miklos Szeredi
     

28 Jul, 2014

1 commit


21 Jul, 2014

1 commit


20 Jul, 2014

2 commits

  • commit 99994cd btrfs: dev delete should remove sysfs entry
    added a btrfs_kobj_rm_device, which dereferences device->bdev...
    right after we check whether device->bdev might be NULL.

    I don't honestly know if it's possible to have a NULL device->bdev
    here, but assuming that it is (given the test), we need to move
    the kobject removal to be under that test.

    (Coverity spotted this)

    Signed-off-by: Eric Sandeen
    Signed-off-by: Chris Mason

    Eric Sandeen
     
  • xfstests generic/127 detected this problem.

    With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush
    data within the passed range. This is the cause of the above problem,
    -- btrfs's fsync has a stage called 'sync log' which will wait for all the
    ordered extents it've recorded to finish.

    In xfstests/generic/127, with mixed operations such as truncate, fallocate,
    punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
    mmap, and then msync. And I find that msync will wait for quite a long time
    (about 20s in my case), thanks to ftrace, it turns out that the previous
    fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
    range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
    there can be some ordered extents created but not getting corresponding pages
    flushed, then they're left in memory until we fsync which runs into the
    stage 'sync log', and fsync will just wait for the system writeback thread
    to flush those pages and get ordered extents finished, so the latency is
    inevitable.

    This adds a flush similar to btrfs_start_ordered_extent() in
    btrfs_wait_logged_extents() to fix that.

    Reviewed-by: Miao Xie
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

04 Jul, 2014

1 commit

  • Pull btrfs fixes from Chris Mason:
    "We've queued up a few fixes in my for-linus branch"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix crash when starting transaction
    Btrfs: fix btrfs_print_leaf for skinny metadata
    Btrfs: fix race of using total_bytes_pinned
    btrfs: use E2BIG instead of EIO if compression does not help
    btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
    Btrfs: fix use-after-free when cloning a trailing file hole
    btrfs: fix null pointer dereference in btrfs_show_devname when name is null
    btrfs: fix null pointer dereference in clone_fs_devices when name is null
    btrfs: fix nossd and ssd_spread mount option regression
    Btrfs: fix race between balance recovery and root deletion
    Btrfs: atomically set inode->i_flags in btrfs_update_iflags
    btrfs: only unlock block in verify_parent_transid if we locked it
    Btrfs: assert send doesn't attempt to start transactions
    btrfs compression: reuse recently used workspace
    Btrfs: fix crash when mounting raid5 btrfs with missing disks
    btrfs: create sprout should rename fsid on the sysfs as well
    btrfs: dev replace should replace the sysfs entry
    btrfs: dev add should add its sysfs entry
    btrfs: dev delete should remove sysfs entry
    btrfs: rename add_device_membership to btrfs_kobj_add_device

    Linus Torvalds
     

03 Jul, 2014

11 commits

  • Often when starting a transaction we commit the currently running transaction,
    which can end up writing block group caches when the current process has its
    journal_info set to NULL (and not to a transaction). This makes our assertion
    at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
    in a crash/hang. Therefore fix it by setting journal_info.

    Two different traces of this issue follow below.

    1)

    [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [51502.242213] ------------[ cut here ]------------
    [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
    [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [51502.244010] Call Trace:
    [51502.244010] [] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [51502.244010] [] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [51502.244010] [] commit_cowonly_roots+0x164/0x226 [btrfs]
    [51502.244010] [] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [51502.244010] [] ? _raw_spin_unlock+0x2b/0x40
    [51502.244010] [] start_transaction+0x459/0x620 [btrfs]
    [51502.244010] [] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [51502.244010] [] __unlink_start_trans+0x31/0xe0 [btrfs]
    [51502.244010] [] btrfs_unlink+0x37/0xc0 [btrfs]
    [51502.244010] [] ? do_unlinkat+0x114/0x2a0
    [51502.244010] [] vfs_unlink+0xcc/0x150
    [51502.244010] [] do_unlinkat+0x260/0x2a0
    [51502.244010] [] ? filp_close+0x64/0x90
    [51502.244010] [] ? trace_hardirqs_on_caller+0x16/0x1e0
    [51502.244010] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [51502.244010] [] SyS_unlinkat+0x1b/0x40
    [51502.244010] [] system_call_fastpath+0x16/0x1b
    [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [51502.244010] RIP [] assfail.constprop.88+0x1e/0x20 [btrfs]

    2)

    [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [25405.097488] ------------[ cut here ]------------
    [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
    [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [25405.100008] Call Trace:
    [25405.100008] [] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [25405.100008] [] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [25405.100008] [] commit_cowonly_roots+0x164/0x226 [btrfs]
    [25405.100008] [] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [25405.100008] [] ? bit_waitqueue+0xc0/0xc0
    [25405.100008] [] start_transaction+0x459/0x620 [btrfs]
    [25405.100008] [] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [25405.100008] [] btrfs_create+0x47/0x210 [btrfs]
    [25405.100008] [] ? btrfs_permission+0x3c/0x80 [btrfs]
    [25405.100008] [] vfs_create+0x9b/0x130
    [25405.100008] [] do_last+0x849/0xe20
    [25405.100008] [] ? link_path_walk+0x79/0x820
    [25405.100008] [] path_openat+0xc5/0x690
    [25405.100008] [] ? trace_hardirqs_on+0xd/0x10
    [25405.100008] [] ? __alloc_fd+0x32/0x1d0
    [25405.100008] [] do_filp_open+0x43/0xa0
    [25405.100008] [] ? __alloc_fd+0x151/0x1d0
    [25405.100008] [] do_sys_open+0x13c/0x230
    [25405.100008] [] ? trace_hardirqs_on_caller+0x16/0x1e0
    [25405.100008] [] SyS_open+0x22/0x30
    [25405.100008] [] system_call_fastpath+0x16/0x1b
    [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [25405.100008] RIP [] assfail.constprop.88+0x1e/0x20 [btrfs]

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We wouldn't actuall print the extent information if we had a skinny metadata
    item, this fixes that. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This percpu counter @total_bytes_pinned is introduced to skip unnecessary
    operations of 'commit transaction', it accounts for those space we may free
    but are stuck in delayed refs.

    And we zero out @space_info->total_bytes_pinned every transaction period so
    we have a better idea of how much space we'll actually free up by committing
    this transaction. However, we do the 'zero out' part a little earlier, before
    we actually unpin space, so we end up returning ENOSPC when we actually have
    free space that's just unpinned from committing transaction.

    xfstests/generic/074 complained then.

    This fixes it by actually accounting the percpu pinned number when 'unpin',
    and since it's protected by space_info->lock, the race is gone now.

    Signed-off-by: Liu Bo
    Reviewed-by: Miao Xie
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Return codes got updated in 60e1975acb48fc3d74a3422b21dde74c977ac3d5
    (btrfs: return errno instead of -1 from compression)
    lzo wrapper returns E2BIG in this case, do the same for zlib.

    Signed-off-by: David Sterba

    David Sterba
     
  • Commit fcebe4562dec83b3f8d3088d77584727b09130b2 (Btrfs: rework qgroup
    accounting) removed the qgroup accounting after delayed refs.

    Signed-off-by: David Sterba

    David Sterba
     
  • The transaction handle was being used after being freed.

    Cc: Chris Mason
    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • dev->name is null but missing flag is not set.
    Strictly speaking the missing flag should have been set, but there
    are more places where code just checks if name is null. For now this
    patch does the same.

    stack:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
    IP: [] btrfs_show_devname+0x58/0xf0 [btrfs]

    [] show_vfsmnt+0x39/0x130
    [] m_show+0x16/0x20
    [] seq_read+0x296/0x390
    [] vfs_read+0x9d/0x160
    [] SyS_read+0x49/0x90
    [] system_call_fastpath+0x16/0x1b

    reproducer:
    mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
    btrfstune -S 1 /dev/sdg1
    modprobe -r btrfs && modprobe btrfs
    mount -o degraded /dev/sdg1 /btrfs
    btrfs dev add /dev/sdg3 /btrfs

    Signed-off-by: Anand Jain
    Signed-off-by: Chris Mason

    Anand Jain
     
  • when one of the device path is missing btrfs_device name is null. So this
    patch will check for that.

    stack:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] strlen+0x0/0x30
    [] ? clone_fs_devices+0xaa/0x160 [btrfs]
    [] btrfs_init_new_device+0x317/0xca0 [btrfs]
    [] ? __kmalloc_track_caller+0x15a/0x1a0
    [] btrfs_ioctl+0xaa3/0x2860 [btrfs]
    [] ? handle_mm_fault+0x48c/0x9c0
    [] ? __blkdev_put+0x171/0x180
    [] ? __do_page_fault+0x4ac/0x590
    [] ? blkdev_put+0x106/0x110
    [] ? mntput+0x35/0x40
    [] do_vfs_ioctl+0x460/0x4a0
    [] ? ____fput+0xe/0x10
    [] ? task_work_run+0xb3/0xd0
    [] SyS_ioctl+0x57/0x90
    [] ? do_page_fault+0xe/0x10
    [] system_call_fastpath+0x16/0x1b

    reproducer:
    mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
    btrfstune -S 1 /dev/sdg1
    modprobe -r btrfs && modprobe btrfs
    mount -o degraded /dev/sdg1 /btrfs
    btrfs dev add /dev/sdg3 /btrfs

    Signed-off-by: Anand Jain
    Signed-off-by: Chris Mason

    Anand Jain
     
  • The commit

    0780253 btrfs: Cleanup the btrfs_parse_options for remount.

    broke ssd options quite badly; it stopped making ssd_spread
    imply ssd, and it made "nossd" unsettable.

    Put things back at least as well as they were before
    (though ssd mount option handling is still pretty odd:
    # mount -o "nossd,ssd_spread" works?)

    Reported-by: Roman Mamedov
    Signed-off-by: Eric Sandeen
    Signed-off-by: Chris Mason

    Eric Sandeen
     
  • Balance recovery is called when RW mounting or remounting from
    RO to RW, it is called to finish roots merging.

    When doing balance recovery, relocation root's corresponding
    fs root(whose root refs is 0) might be destroyed by cleaner
    thread, this will make btrfs fail to mount.

    Fix this problem by holding @cleaner_mutex when doing balance
    recovery.

    Signed-off-by: Wang Shilong
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • This change is based on the corresponding recent change for ext4:

    ext4: atomically set inode->i_flags in ext4_set_inode_flags()

    That has the following commit message that applies to btrfs as well:

    "Use cmpxchg() to atomically set i_flags instead of clearing out the
    S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
    EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
    where an immutable file has the immutable flag cleared for a brief
    window of time."

    Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE
    and BTRFS_INODE_APPEND, respectively.

    Reviewed-by: David Sterba
    Reviewed-by: Satoru Takeuchi
    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

29 Jun, 2014

9 commits

  • This is a regression from my patch a26e8c9f75b0bfd8cccc9e8f110737b136eb5994, we
    need to only unlock the block if we were the one who locked it. Otherwise this
    will trip BUG_ON()'s in locking.c Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • When starting a transaction just assert that current->journal_info
    doesn't contain a send transaction stub, since send isn't supposed
    to start transactions and when it finishes (either successfully or
    not) it's supposed to set current->journal_info to NULL.

    This is motivated by the change titled:

    Btrfs: fix crash when starting transaction

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Add compression `workspace' in free_workspace() to
    `idle_workspace' list head, instead of tail. So we have
    better chances to reuse most recently used `workspace'.

    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Sergey Senozhatsky
     
  • The reproducer is

    $ mkfs.btrfs D1 D2 D3 -mraid5
    $ mkfs.ext4 D2 && mkfs.ext4 D3
    $ mount D1 /btrfs -odegraded

    -------------------

    [ 87.672992] ------------[ cut here ]------------
    [ 87.673845] kernel BUG at fs/btrfs/raid56.c:1828!
    ...
    [ 87.673845] RIP: 0010:[] [] __raid_recover_end_io+0x4ae/0x4d0
    ...
    [ 87.673845] Call Trace:
    [ 87.673845] [] ? mempool_free+0x36/0xa0
    [ 87.673845] [] raid_recover_end_io+0x75/0xa0
    [ 87.673845] [] bio_endio+0x5b/0xa0
    [ 87.673845] [] bio_endio_nodec+0x12/0x20
    [ 87.673845] [] end_workqueue_fn+0x41/0x50
    [ 87.673845] [] normal_work_helper+0xca/0x2c0
    [ 87.673845] [] process_one_work+0x1eb/0x530
    [ 87.673845] [] ? process_one_work+0x189/0x530
    [ 87.673845] [] worker_thread+0x11b/0x4f0
    [ 87.673845] [] ? rescuer_thread+0x290/0x290
    [ 87.673845] [] kthread+0xe4/0x100
    [ 87.673845] [] ? kthread_create_on_node+0x220/0x220
    [ 87.673845] [] ret_from_fork+0x7c/0xb0
    [ 87.673845] [] ? kthread_create_on_node+0x220/0x220

    -------------------

    It's because that we miscalculate @rbio->bbio->error so that it doesn't
    reach maximum of tolerable errors while it should have.

    Signed-off-by: Liu Bo
    Tested-by: Satoru Takeuchi
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Creating sprout will change the fsid of the mounted root.
    do the same on the sysfs as well.

    reproducer:
    mount /dev/sdb /btrfs (seed disk)
    btrfs dev add /dev/sdc /btrfs
    mount -o rw,remount /btrfs
    btrfs dev del /dev/sdb /btrfs
    mount /dev/sdb /btrfs

    Error:
    kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Anand Jain
     
  • when we replace the device its corresponding sysfs
    entry has to be replaced as well

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Anand Jain
     
  • we would need the device links to be created,
    when device is added.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Anand Jain
     
  • when we delete the device from the mounted btrfs,
    we would need its corresponding sysfs enty to
    be removed as well.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Anand Jain
     
  • Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Anand Jain
     

22 Jun, 2014

1 commit

  • Pull btrfs fixes from Chris Mason:
    "This fixes some lockups in btrfs reported with rc1. It probably has
    some performance impact because it is backing off our spinning locks
    more often and switching to a blocking lock. I'll be able to nail
    that down next week, but for now I want to get the lockups taken care
    of.

    Otherwise some more stack reduction and assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix wrong error handle when the device is missing or is not writeable
    Btrfs: fix deadlock when mounting a degraded fs
    Btrfs: use bio_endio_nodec instead of open code
    Btrfs: fix NULL pointer crash when running balance and scrub concurrently
    btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
    Btrfs: fix broken free space cache after the system crashed
    Btrfs: make free space cache write out functions more readable
    Btrfs: remove unused wait queue in struct extent_buffer
    Btrfs: fix deadlocks with trylock on tree nodes

    Linus Torvalds