07 Mar, 2015

3 commits

  • Pull btrfs fixes from Chris Mason:
    "Outside of misc fixes, Filipe has a few fsync corners and we're
    pulling in one more of Josef's fixes from production use here"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
    Btrfs: fix data loss in the fast fsync path
    Btrfs: remove extra run_delayed_refs in update_cowonly_root
    Btrfs: incremental send, don't rename a directory too soon
    btrfs: fix lost return value due to variable shadowing
    Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
    Btrfs: fix off-by-one logic error in btrfs_realloc_node
    Btrfs: add missing inode update when punching hole
    Btrfs: abort the transaction if we fail to update the free space cache inode
    Btrfs: fix fsync race leading to ordered extent memory leaks

    Linus Torvalds
     
  • Pull file locking fix from Jeff Layton:
    "Just a single patch to fix a memory leak that Daniel Wagner discovered
    while doing some testing with leases"

    * tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux:
    locks: fix fasync_struct memory leak in lease upgrade/downgrade handling

    Linus Torvalds
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - Fix a regression in the NFSv4 open state recovery code
    - Fix a regression in the NFSv4 close code
    - Fix regressions and side-effects of the loop-back mounted NFS fixes
    in 3.18, that cause the NFS read() syscall to return EBUSY.
    - Fix regressions around the readdirplus code and how it interacts
    with the VFS lazy unmount changes that went into v3.18.
    - Fix issues with out-of-order RPC call replies replacing updated
    attributes with stale ones (particularly after a truncate()).
    - Fix an underflow checking issue with RPC/RDMA credits
    - Fix a number of issues with the NFSv4 delegation return/free code.
    - Fix issues around stale NFSv4.1 leases when doing a mount"

    * tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
    NFSv4.1: Clear the old state by our client id before establishing a new lease
    NFSv4: Fix a race in NFSv4.1 server trunking discovery
    NFS: Don't write enable new pages while an invalidation is proceeding
    NFS: Fix a regression in the read() syscall
    NFSv4: Ensure we skip delegations that are already being returned
    NFSv4: Pin the superblock while we're returning the delegation
    NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
    NFSv4: Ensure that we don't reap a delegation that is being returned
    NFS: Fix stateid used for NFS v4 closes
    NFSv4: Don't call put_rpccred() under the rcu_read_lock()
    NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
    NFSv3: Use the readdir fileid as the mounted-on-fileid
    NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
    NFSv4: Set a barrier in the update_changeattr() helper
    NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
    NFS: Remove size hack in nfs_inode_attrs_need_update()
    NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
    NFS: Add attribute update barriers to NFS writebacks
    NFS: Set an attribute barrier on all updates
    NFS: Add attribute update barriers to nfs_setattr_update_inode()
    ...

    Linus Torvalds
     

06 Mar, 2015

3 commits

  • Improper arithmetics when calculting the address of the extended ref could
    lead to an out of bounds memory read and kernel panic.

    Signed-off-by: Quentin Casasnovas
    Reviewed-by: David Sterba
    cc: stable@vger.kernel.org # v3.7+
    Signed-off-by: Chris Mason

    Quentin Casasnovas
     
  • When using the fast file fsync code path we can miss the fact that new
    writes happened since the last file fsync and therefore return without
    waiting for the IO to finish and write the new extents to the fsync log.

    Here's an example scenario where the fsync will miss the fact that new
    file data exists that wasn't yet durably persisted:

    1. fs_info->last_trans_committed == N - 1 and current transaction is
    transaction N (fs_info->generation == N);

    2. do a buffered write;

    3. fsync our inode, this clears our inode's full sync flag, starts
    an ordered extent and waits for it to complete - when it completes
    at btrfs_finish_ordered_io(), the inode's last_trans is set to the
    value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
    btrfs_set_inode_last_trans);

    4. transaction N is committed, so fs_info->last_trans_committed is now
    set to the value N and fs_info->generation remains with the value N;

    5. do another buffered write, when this happens btrfs_file_write_iter
    sets our inode's last_trans to the value N + 1 (that is
    fs_info->generation + 1 == N + 1);

    6. transaction N + 1 is started and fs_info->generation now has the
    value N + 1;

    7. transaction N + 1 is committed, so fs_info->last_trans_committed
    is set to the value N + 1;

    8. fsync our inode - because it doesn't have the full sync flag set,
    we only start the ordered extent, we don't wait for it to complete
    (only in a later phase) therefore its last_trans field has the
    value N + 1 set previously by btrfs_file_write_iter(), and so we
    have:

    inode->last_trans last_trans_committed
    (N + 1) (N + 1)

    Which made us not log the last buffered write and exit the fsync
    handler immediately, returning success (0) to user space and resulting
    in data loss after a crash.

    This can actually be triggered deterministically and the following excerpt
    from a testcase I made for xfstests triggers the issue. It moves a dummy
    file across directories and then fsyncs the old parent directory - this
    is just to trigger a transaction commit, so moving files around isn't
    directly related to the issue but it was chosen because running 'sync' for
    example does more than just committing the current transaction, as it
    flushes/waits for all file data to be persisted. The issue can also happen
    at random periods, since the transaction kthread periodicaly commits the
    current transaction (about every 30 seconds by default).
    The body of the test is:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create our main test file 'foo', the one we check for data loss.
    # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
    # bit from its flags (btrfs inode specific flags).
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
    -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

    # Now create one other file and 2 directories. We will move this second file
    # from one directory to the other later because it forces btrfs to commit its
    # currently open transaction if we fsync the old parent directory. This is
    # necessary to trigger the data loss bug that affected btrfs.
    mkdir $SCRATCH_MNT/testdir_1
    touch $SCRATCH_MNT/testdir_1/bar
    mkdir $SCRATCH_MNT/testdir_2

    # Make sure everything is durably persisted.
    sync

    # Write more 8Kb of data to our file.
    $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

    # Move our 'bar' file into a new directory.
    mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

    # Fsync our first directory. Because it had a file moved into some other
    # directory, this made btrfs commit the currently open transaction. This is
    # a condition necessary to trigger the data loss bug.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

    # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
    # data we wrote previously to be persisted and available if a crash happens.
    # This did not happen with btrfs, because of the transaction commit that
    # happened when we fsynced the parent directory.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Simulate a crash/power loss.
    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Now check that all data we wrote before are available.
    echo "File content after log replay:"
    od -t x1 $SCRATCH_MNT/foo

    status=0
    exit

    The expected golden output for the test, which is what we get with this
    fix applied (or when running against ext3/4 and xfs), is:

    wrote 8192/8192 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    wrote 8192/8192 bytes at offset 8192
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    File content after log replay:
    0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    *
    0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
    *
    0040000

    Without this fix applied, the output shows the test file does not have
    the second 8Kb extent that we successfully fsynced:

    wrote 8192/8192 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    wrote 8192/8192 bytes at offset 8192
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    File content after log replay:
    0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    *
    0020000

    So fix this by skipping the fsync only if we're doing a full sync and
    if the inode's last_trans is last_trans_committed, or if
    the inode is already in the log. Also remove setting the inode's
    last_trans in btrfs_file_write_iter since it's useless/unreliable.

    Also because btrfs_file_write_iter no longer sets inode->last_trans to
    fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
    bail out if last_trans is 0, otherwise something as simple as the following
    example wouldn't log the second write on the last fsync:

    1. write to file

    2. fsync file

    3. fsync file
    |--> btrfs_inode_in_log() returns true and it set last_trans to 0

    4. write to file
    |--> btrfs_file_write_iter() no longers sets last_trans, so it
    remained with a value of 0
    5. fsync
    |--> inode->last_trans == 0, so it bails out without logging the
    second write

    A test case for xfstests will be sent soon.

    CC:
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • This got added with my dirty_bgs patch, it's not needed. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

05 Mar, 2015

2 commits

  • Commit 8634b51f6ca2 (locks: convert lease handling to file_lock_context)
    introduced a regression in the handling of lease upgrade/downgrades.

    In the event that we already have a lease on a file and are going to
    either upgrade or downgrade it, we skip doing any list insertion or
    deletion and simply re-call lm_setup on the existing lease.

    As of commit 8634b51f6ca2 however, we end up calling lm_setup on the
    lease that was passed in, instead of on the existing lease. This causes
    us to leak the fasync_struct that was allocated in the event that there
    was not already an existing one (as it always appeared that there
    wasn't one).

    Fixes: 8634b51f6ca2 (locks: convert lease handling to file_lock_context)
    Reported-and-Tested-by: Daniel Wagner
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Pull eCryptfs fixes from Tyler Hicks:
    "Fixes for proper ioctl handling and an untriggerable buffer overflow

    - The eCryptfs ioctl handling functions should only pass known-good
    ioctl commands to the lower filesystem

    - A static checker found a potential buffer overflow. Upon
    inspection, it is not triggerable due to input validation performed
    on the mount parameters"

    * tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: don't pass fs-specific ioctl commands through
    eCryptfs: ensure copy to crypt_stat->cipher does not overrun

    Linus Torvalds
     

04 Mar, 2015

5 commits

  • If the call to exchange-id returns with the EXCHGID4_FLAG_CONFIRMED_R flag
    set, then that means our lease was established by a previous mount instance.
    Ensure that we detect this situation, and that we clear the state held by
    that mount.

    Reported-by: Jorge Mora
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We do not want to allow a race with another NFS mount to cause
    nfs41_walk_client_list() to establish a lease on our nfs_client before
    we're done checking for trunking.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull nfsd fixes from Bruce Fields:
    "Three miscellaneous bugfixes, most importantly the clp->cl_revoked
    bug, which we've seen several reports of people hitting"

    * 'for-4.0' of git://linux-nfs.org/~bfields/linux:
    sunrpc: integer underflow in rsc_parse()
    nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd
    svcrpc: fix memory leak in gssp_accept_sec_context_upcall

    Linus Torvalds
     
  • nfs_vm_page_mkwrite() should wait until the page cache invalidation
    is finished. This is the second patch in a 2 patch series to deprecate
    the NFS client's reliance on nfs_release_page() in the context of
    nfs_invalidate_mapping().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • When invalidating the page cache for a regular file, we want to first
    sync all dirty data to disk and then call invalidate_inode_pages2().
    The latter relies on nfs_launder_page() and nfs_release_page() to deal
    respectively with dirty pages, and unstable written pages.

    When commit 9590544694bec ("NFS: avoid deadlocks with loop-back mounted
    NFS filesystems.") changed the behaviour of nfs_release_page(), then it
    made it possible for invalidate_inode_pages2() to fail with an EBUSY.
    Unfortunately, that error is then propagated back to read().

    Let's therefore work around the problem for now by protecting the call
    to sync the data and invalidate_inode_pages2() so that they are atomic
    w.r.t. the addition of new writes.
    Later on, we can revisit whether or not we still need nfs_launder_page()
    and nfs_release_page().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

03 Mar, 2015

13 commits

  • eCryptfs can't be aware of what to expect when after passing an
    arbitrary ioctl command through to the lower filesystem. The ioctl
    command may trigger an action in the lower filesystem that is
    incompatible with eCryptfs.

    One specific example is when one attempts to use the Btrfs clone
    ioctl command when the source file is in the Btrfs filesystem that
    eCryptfs is mounted on top of and the destination fd is from a new file
    created in the eCryptfs mount. The ioctl syscall incorrectly returns
    success because the command is passed down to Btrfs which thinks that it
    was able to do the clone operation. However, the result is an empty
    eCryptfs file.

    This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
    commands through and then copies up the inode metadata from the lower
    inode to the eCryptfs inode to catch any changes made to the lower
    inode's metadata. Those five ioctl commands are mostly common across all
    filesystems but the whitelist may need to be further pruned in the
    future.

    https://bugzilla.kernel.org/show_bug.cgi?id=93691
    https://launchpad.net/bugs/1305335

    Signed-off-by: Tyler Hicks
    Cc: Rocko
    Cc: Colin Ian King
    Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8 eCryptfs: Handle ioctl calls with unlocked and compat functions

    Tyler Hicks
     
  • In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
    we want to optimise the loop traversal by skipping delegations that are
    already in the process of being returned.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This patch ensures that the superblock doesn't go ahead and disappear
    underneath us while the state manager thread is returning delegations.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
    delegation that is already in the process of being returned.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • After 566fcec60 the client uses the "current stateid" from the
    nfs4_state structure to close a file. This could potentially contain a
    delegation stateid, which is disallowed by the protocol and causes
    servers to return NFS4ERR_BAD_STATEID. This patch restores the
    (correct) behavior of sending the open stateid to close a file.

    Reported-by: Olga Kornievskaia
    Fixes: 566fcec60 (NFSv4: Fix an atomicity problem in CLOSE)
    Signed-off-by: Anna Schumaker
    Signed-off-by: Trond Myklebust

    Anna Schumaker
     
  • There's one more case where we can't issue a rename operation for a
    directory as soon as we process it. We used to delay directory renames
    only if they have some ancestor directory with a higher inode number
    that got renamed too, but there's another case where we need to delay
    the rename too - when a directory A is renamed to the old name of a
    directory B but that directory B has its rename delayed because it
    has now (in the send root) an ancestor with a higher inode number that
    was renamed. If we don't delay the directory rename in this case, the
    receiving end of the send stream will attempt to rename A to the old
    name of B before B got renamed to its new name, which results in a
    "directory not empty" error. So fix this by delaying directory renames
    for this case too.

    Steps to reproduce:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ mkdir /mnt/a
    $ mkdir /mnt/b
    $ mkdir /mnt/c
    $ touch /mnt/a/file

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1

    $ mv /mnt/c /mnt/x
    $ mv /mnt/a /mnt/x/y
    $ mv /mnt/b /mnt/a

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2

    $ btrfs send /mnt/snap1 -f /tmp/1.send
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt2
    $ btrfs receive /mnt2 -f /tmp/1.send
    $ btrfs receive /mnt2 -f /tmp/2.send
    ERROR: rename b -> a failed. Directory not empty

    A test case for xfstests follows soon.

    Reported-by: Ames Cornish
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • A block-local variable stores error code but btrfs_get_blocks_direct may
    not return it in the end as there's a ret defined in the function scope.

    CC: # 3.6+
    Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO")
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • The return value from btrfs_lookup_xattr() can be a pointer encoding an
    error, therefore deal with it. This fixes commit 5f5bc6b1e2d5
    ("Btrfs: make xattr replace operations atomic").

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • The end_slot variable actually matches the number of pointers in the
    node and not the last slot (which is 'nritems - 1'). Therefore in order
    to check that the current slot in the for loop doesn't match the last
    one, the correct logic is to check if 'i' is less than 'end_slot - 1'
    and not 'end_slot - 2'.

    Fix this and set end_slot to be 'nritems - 1', as it's less confusing
    since the variable name implies it's inclusive rather then exclusive.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • When punching a file hole if we endup only zeroing parts of a page,
    because the start offset isn't a multiple of the sector size or the
    start offset and length fall within the same page, we were not updating
    the inode item. This prevented an fsync from doing anything, if no other
    file changes happened in the current transaction, because the fields
    in btrfs_inode used to check if the inode needs to be fsync'ed weren't
    updated.

    This issue is easy to reproduce and the following excerpt from the
    xfstest case I made shows how to trigger it:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create our test file.
    $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
    $SCRATCH_MNT/foo | _filter_xfs_io

    # Fsync the file, this makes btrfs update some btrfs inode specific fields
    # that are used to track if the inode needs to be written/updated to the fsync
    # log or not. After this fsync, the new values for those fields indicate that
    # a subsequent fsync does not need to touch the fsync log.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Force a commit of the current transaction. After this point, any operation
    # that modifies the data or metadata of our file, should update those fields in
    # the btrfs inode with values that make the next fsync operation write to the
    # fsync log.
    sync

    # Punch a hole in our file. This small range affects only 1 page.
    # This made the btrfs hole punching implementation write only some zeroes in
    # one page, but it did not update the btrfs inode fields used to determine if
    # the next fsync needs to write to the fsync log.
    $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo

    # Another variation of the previously mentioned case.
    $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo

    # Now fsync the file. This was a no-operation because the previous hole punch
    # operation didn't update the inode's fields mentioned before, so they remained
    # with the values they had after the first fsync - that is, they indicate that
    # it is not needed to write to fsync log.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    echo "File content before:"
    od -t x1 $SCRATCH_MNT/foo

    # Simulate a crash/power loss.
    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    # Enable writes and mount the fs. This makes the fsync log replay code run.
    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Because the last fsync didn't do anything, here the file content matched what
    # it was after the first fsync, before the holes were punched, and not what it
    # was after the holes were punched.
    echo "File content after:"
    od -t x1 $SCRATCH_MNT/foo

    This issue has been around since 2012, when the punch hole implementation
    was added, commit 2aaa66558172 ("Btrfs: add hole punching").

    A test case for xfstests follows soon.

    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Our gluster boxes were hitting a problem where they'd run out of space when
    updating the block group cache and therefore wouldn't be able to update the free
    space inode. This is a problem because this is how we invalidate the cache and
    protect ourselves from errors further down the stack, so if this fails we have
    to abort the transaction so we make sure we don't end up with stale free space
    cache. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We can have multiple fsync operations against the same file during the
    same transaction and they can collect the same ordered extents while they
    don't complete (still accessible from the inode's ordered tree). If this
    happens, those ordered extents will never get their reference counts
    decremented to 0, leading to memory leaks and inode leaks (an iput for an
    ordered extent's inode is scheduled only when the ordered extent's refcount
    drops to 0). The following sequence diagram explains this race:

    CPU 1 CPU 2

    btrfs_sync_file()

    btrfs_sync_file()

    mutex_lock(inode->i_mutex)
    btrfs_log_inode()
    btrfs_get_logged_extents()
    --> collects ordered extent X
    --> increments ordered
    extent X's refcount
    btrfs_submit_logged_extents()
    mutex_unlock(inode->i_mutex)

    mutex_lock(inode->i_mutex)
    btrfs_sync_log()
    btrfs_wait_logged_extents()
    --> list_del_init(&ordered->log_list)
    btrfs_log_inode()
    btrfs_get_logged_extents()
    --> Adds ordered extent X
    to logged_list because
    at this point:
    list_empty(&ordered->log_list)
    && test_bit(BTRFS_ORDERED_LOGGED,
    &ordered->flags) == 0
    --> Increments ordered extent
    X's refcount
    --> check if ordered extent's io is
    finished or not, start it if
    necessary and wait for it to finish
    --> sets bit BTRFS_ORDERED_LOGGED
    on ordered extent X's flags
    and adds it to trans->ordered
    btrfs_sync_log() finishes

    btrfs_submit_logged_extents()
    btrfs_log_inode() finishes
    mutex_unlock(inode->i_mutex)

    btrfs_sync_file() finishes

    btrfs_sync_log()
    btrfs_wait_logged_extents()
    --> Sees ordered extent X has the
    bit BTRFS_ORDERED_LOGGED set in
    its flags
    --> X's refcount is untouched
    btrfs_sync_log() finishes

    btrfs_sync_file() finishes

    btrfs_commit_transaction()
    --> called by transaction kthread for e.g.
    btrfs_wait_pending_ordered()
    --> waits for ordered extent X to
    complete
    --> decrements ordered extent X's
    refcount by 1 only, corresponding
    to the increment done by the fsync
    task ran by CPU 1

    In the scenario of the above diagram, after the transaction commit,
    the ordered extent will remain with a refcount of 1 forever, leaking
    the ordered extent structure and preventing the i_count of its inode
    from ever decreasing to 0, since the delayed iput is scheduled only
    when the ordered extent's refcount drops to 0, preventing the inode
    from ever being evicted by the VFS.

    Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
    mean that an ordered extent is already being processed by an fsync call,
    which will attach it to the current transaction, preventing it from being
    collected by subsequent fsync operations against the same inode.

    This race was introduced with the following change (added in 3.19 and
    backported to stable 3.18 and 3.17):

    Btrfs: make sure logged extents complete in the current transaction V3
    commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f

    I ran into this issue while running xfstests/generic/113 in a loop, which
    failed about 1 out of 10 runs with the following warning in dmesg:

    [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
    [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
    l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
    ppy e1000 scsi_mod [last unloaded: btrfs]
    [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1
    [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
    [ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
    [ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
    [ 2612.463163] Call Trace:
    [ 2612.463719] [] dump_stack+0x4c/0x65
    [ 2612.464789] [] warn_slowpath_common+0xa1/0xbb
    [ 2612.466026] [] ? free_fs_root+0x36/0x133 [btrfs]
    [ 2612.467247] [] warn_slowpath_null+0x1a/0x1c
    [ 2612.468416] [] free_fs_root+0x36/0x133 [btrfs]
    [ 2612.469625] [] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
    [ 2612.471251] [] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
    [ 2612.472536] [] ? wait_for_completion+0x24/0x26
    [ 2612.473742] [] close_ctree+0x1f3/0x33c [btrfs]
    [ 2612.475477] [] ? destroy_workqueue+0x148/0x1ba
    [ 2612.476695] [] btrfs_put_super+0x19/0x1b [btrfs]
    [ 2612.477911] [] generic_shutdown_super+0x73/0xef
    [ 2612.479106] [] kill_anon_super+0x13/0x1e
    [ 2612.480226] [] btrfs_kill_super+0x17/0x23 [btrfs]
    [ 2612.481471] [] deactivate_locked_super+0x3b/0x50
    [ 2612.482686] [] deactivate_super+0x3f/0x43
    [ 2612.483791] [] cleanup_mnt+0x59/0x78
    [ 2612.484842] [] __cleanup_mnt+0x12/0x14
    [ 2612.485900] [] task_work_run+0x8f/0xbc
    [ 2612.486960] [] do_notify_resume+0x5a/0x6b
    [ 2612.488083] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 2612.489333] [] int_signal+0x12/0x17
    [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
    [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day...

    Kmemleak confirmed the ordered extent leak (and btrfs inode specific
    structures such as delayed nodes):

    $ cat /sys/kernel/debug/kmemleak
    unreferenced object 0xffff880154290db0 (size 576):
    comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
    hex dump (first 32 bytes):
    01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff .@.........N....
    00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T....
    backtrace:
    [] kmemleak_update_trace+0x4c/0x6a
    [] radix_tree_node_alloc+0x6d/0x83
    [] __radix_tree_create+0x109/0x190
    [] radix_tree_insert+0x30/0xac
    [] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [] btrfs_unlink+0x60/0x9b [btrfs]
    [] vfs_unlink+0x9c/0xed
    [] do_unlinkat+0x12c/0x1fa
    [] SyS_unlinkat+0x29/0x2b
    [] system_call_fastpath+0x12/0x17
    [] 0xffffffffffffffff
    unreferenced object 0xffff88014ef11db0 (size 576):
    comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
    hex dump (first 32 bytes):
    02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N....
    backtrace:
    [] kmemleak_update_trace+0x4c/0x6a
    [] radix_tree_node_alloc+0x6d/0x83
    [] __radix_tree_create+0x109/0x190
    [] radix_tree_insert+0x30/0xac
    [] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [] btrfs_unlink+0x60/0x9b [btrfs]
    [] vfs_unlink+0x9c/0xed
    [] do_unlinkat+0x12c/0x1fa
    [] SyS_unlinkat+0x29/0x2b
    [] system_call_fastpath+0x12/0x17
    [] 0xffffffffffffffff
    unreferenced object 0xffff8800336feda8 (size 584):
    comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
    hex dump (first 32 bytes):
    00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B....
    00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................
    backtrace:
    [] create_object+0x172/0x29a
    [] kmemleak_alloc+0x25/0x41
    [] kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [] kmem_cache_alloc+0xf7/0x198
    [] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
    [] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
    [] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
    [] do_blockdev_direct_IO+0x62a/0xb47
    [] __blockdev_direct_IO+0x34/0x36
    [] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
    [] generic_file_direct_write+0xb8/0x12d
    [] btrfs_file_write_iter+0x24b/0x42f [btrfs]
    [] aio_run_iocb+0x2b7/0x32e
    [] do_io_submit+0x26e/0x2ff
    [] SyS_io_submit+0x10/0x12
    [] system_call_fastpath+0x12/0x17

    CC: # 3.19, 3.18 and 3.17
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

02 Mar, 2015

13 commits


01 Mar, 2015

1 commit

  • Pull xfs fixes from Dave Chinner:
    "These are fixes for regressions/bugs introduced in the 4.0 merge cycle
    and problems discovered during the merge window that need to be pushed
    back to stable kernels ASAP.

    This contains:
    - ensure quota type is reset in on-disk dquots
    - fix missing partial EOF block data flush on truncate extension
    - fix transaction leak in error handling for new pnfs block layout
    support
    - add missing target_ip check to RENAME_EXCHANGE"

    * tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: cancel failed transaction in xfs_fs_commit_blocks()
    xfs: Ensure we have target_ip for RENAME_EXCHANGE
    xfs: ensure truncate forces zeroed blocks to disk
    xfs: Fix quota type in quota structures when reusing quota file

    Linus Torvalds