21 Sep, 2013

1 commit

  • This is a left over of how we used to wait for ordered extents, which was to
    grab the inode and then run filemap flush on it. However if we have an ordered
    extent then we already are holding a ref on the inode, and we just use
    btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
    the inode to start work on the ordered extent. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

01 Sep, 2013

3 commits

  • We currently have this problem where you can truncate pages that have not yet
    been written for an ordered extent. We do this because the truncate will be
    coming behind to clean us up anyway so what's the harm right? Well if truncate
    fails for whatever reason we leave an orphan item around for the file to be
    cleaned up later. But if the user goes and truncates up the file and tries to
    read from the area that had been discarded previously they will get a csum error
    because we never actually wrote that data out.

    This patch fixes this by allowing us to either discard the ordered extent
    completely, by which I mean we just free up the space we had allocated and not
    add the file extent, or adjust the length of the file extent we write. We do
    this by setting the length we truncated down to in the ordered extent, and then
    we set the file extent length and ram bytes to this length. The total disk
    space stays unchanged since we may be compressed and we can't just chop off the
    disk space, but at least this way the file extent only points to the valid data.
    Then when the file extent is free'd the extent and csums will be freed normally.

    This patch is needed for the next series which will give us more graceful
    recovery of failed truncates. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • u64 is "unsigned long long" on all architectures now, so there's no need to
    cast it when formatting it using the "ll" length modifier.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Geert Uytterhoeven
     
  • I added a patch where we started taking the ordered operations mutex when we
    waited on ordered extents. We need this because we splice the list and process
    it, so if a flusher came in during this scenario it would think the list was
    empty and we'd usually get an early ENOSPC. The problem with this is that this
    lock is used in transaction committing. So we end up with something like this

    Transaction commit
    -> wait on writers

    Delalloc flusher
    -> run_ordered_operations (holds mutex)
    ->wait for filemap-flush to do its thing

    flush task
    -> cow_file_range
    ->wait on btrfs_join_transaction because we're commiting

    some other task
    -> commit_transaction because we notice trans->transaction->flush is set
    -> run_ordered_operations (hang on mutex)

    We need to disentangle the ordered operations flushing from the delalloc
    flushing, since they are separate things. This solves the deadlock issue I was
    seeing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

02 Jul, 2013

1 commit

  • Using the structure btrfs_sector_sum to keep the checksum value is
    unnecessary, because the extents that btrfs_sector_sum points to are
    continuous, we can find out the expected checksums by btrfs_ordered_sum's
    bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
    removing bytenr, there is only one member in the structure, so it makes
    no sense to keep the structure, just remove it, and use a u32 array to
    store the checksum value.

    By this change, we don't use the while loop to get the checksums one by
    one. Now, we can get several checksum value at one time, it improved the
    performance by ~74% on my SSD (31MB/s -> 54MB/s).

    test command:
    # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

14 Jun, 2013

1 commit


07 May, 2013

1 commit

  • It is very likely that there are several blocks in bio, it is very
    inefficient if we get their csums one by one. This patch improves
    this problem by getting the csums in batch.

    According to the result of the following test, the execute time of
    __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).

    # dd if=/file of=/dev/null bs=1M count=1024

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

28 Mar, 2013

1 commit

  • We need to hold the ordered_operations mutex while waiting on ordered extents
    since we splice and run the ordered extents list. We need to make sure anybody
    else who wants to wait on ordered extents does actually wait for them to be
    completed. This will keep us from bailing out of flushing in case somebody is
    already waiting on ordered extents to complete. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

21 Feb, 2013

1 commit

  • Miao made the ordered operations stuff run async, which introduced a
    deadlock where we could get somebody (sync) racing in and committing the
    transaction while a commit was already happening. The new committer would
    try and flush ordered operations which would hang waiting for the commit to
    finish because it is done asynchronously and no longer inherits the callers
    trans handle. To fix this we need to make the ordered operations list a per
    transaction list. We can get new inodes added to the ordered operation list
    by truncating them and then having another process writing to them, so this
    makes it so that anybody trying to add an ordered operation _must_ start a
    transaction in order to add itself to the list, which will keep new inodes
    from getting added to the ordered operations list after we start committing.
    This should fix the deadlock and also keeps us from doing a lot more work
    than we need to during commit. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

20 Feb, 2013

2 commits

  • btrfs_run_ordered_operations() needn't traverse the ordered operation list
    repeatedly, it is because the transaction commiter will invoke it again when
    there is no other writer in this transaction, it can ensure that no one can
    add new objects into the ordered operation list.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • Since we don't actually copy the extent information from the source tree in
    the fast case we don't need to wait for ordered io to be completed in order
    to fsync, we just need to wait for the io to be completed. So when we're
    logging our file just attach all of the ordered extents to the log, and then
    when the log syncs just wait for IO_DONE on the ordered extents and then
    write the super. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

06 Feb, 2013

2 commits

  • We specifically do not update the disk i_size if there are ordered extents
    outstanding for any area between the current disk_i_size and our ordered
    extent so that we do not expose stale data. The problem is the check we
    have only checks if the ordered extent starts at or after the current
    disk_i_size, which doesn't take into account an ordered extent that starts
    before the current disk_i_size and ends past the disk_i_size. Fix this by
    checking if the extent ends past the disk_i_size. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • If we have an ordered extent before the ordered extent we are currently
    completing that is after the current disk_i_size we will put our i_size
    update into that ordered extent so that we do not expose stale data. The
    problem is that if our disk i_size is updated past the previous ordered
    extent we won't update the i_size with the pending i_size update. So check
    the pending i_size update and if its above the current disk i_size we need
    to go ahead and try to update. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

13 Dec, 2012

1 commit


12 Dec, 2012

2 commits


04 Oct, 2012

1 commit


02 Oct, 2012

2 commits

  • The ordered extent allocation is in the fast path of the IO, so use a slab
    to improve the speed of the allocation.

    "Size of the struct is 280, so this will fall into the size-512 bucket,
    giving 8 objects per page, while own slab will pack 14 objects into a page.

    Another benefit I see is to check for leaked objects when the module is
    removed (and the cache destroy takes place)."
    -- David Sterba

    Signed-off-by: Miao Xie

    Miao Xie
     
  • If a snapshot is created while we are writing some data into the file,
    the i_size of the corresponding file in the snapshot will be wrong, it will
    be beyond the end of the last file extent. And btrfsck will report:
    root 256 inode 257 errors 100

    Steps to reproduce:
    # mkfs.btrfs
    # mount
    # cd
    # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
    # for ((i=0; i do
    > btrfs sub snap . $i
    > done

    This because the algorithm of disk_i_size update is wrong. Though there are
    some ordered extents behind the current one which we use to update disk_i_size,
    it doesn't mean those extents will be dealt with in the same transaction. So
    We shouldn't use the offset of those extents to update disk_i_size. Or we will
    get the wrong i_size in the snapshot.

    We fix this problem by recording the max real i_size. If we find there is a
    ordered extent which is in front of the current one and doesn't complete, we
    will record the end of the current one into that ordered extent. Surely, if
    the current extent holds the end of other extent(it must be greater than
    the current one because it is behind the current one), we will record the
    number that the current extent holds. In this way, we can exclude the ordered
    extents that may not be dealth with in the same transaction, and be easy to
    know the real disk_i_size.

    Signed-off-by: Miao Xie

    Miao Xie
     

04 Aug, 2012

1 commit


15 Jun, 2012

1 commit

  • I removed this in an earlier commit and I was wrong. Because compression
    can return from filemap_fdatawrite() without having actually set any of it's
    pages as writeback() it can make filemap_fdatawait() do essentially nothing,
    and then we won't find any ordered extents because they may not have been
    created yet. So not only does this make fsync() completely useless, but it
    will also screw up if you truncate on a non-page aligned offset since we
    zero out the end and then wait on ordered extents and then call drop caches.
    We can drop the cache before the io completes and then we try to unpin the
    extent we just wrote we won't find it and everything goes sideways. So fix
    this by putting it back and put a giant comment there to keep me from trying
    to remove it in the future. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

30 May, 2012

3 commits

  • We noticed that the ordered extent completion doesn't really rely on having
    a page and that it could be done independantly of ending the writeback on a
    page. This patch makes us not do the threaded endio stuff for normal
    buffered writes and direct writes so we can end page writeback as soon as
    possible (in irq context) and only start threads to do the ordered work when
    it is actually done. Compression needs to be reworked some to take
    advantage of this as well, but atm it has to do a find_get_page in its endio
    handler so it must be done in its own thread. This makes direct writes
    quite a bit faster. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We are checking delalloc to see if it is ok to update the i_size. There are
    2 cases it stops us from updating

    1) If there is delalloc between our current disk_i_size and this ordered
    extent

    2) If there is delalloc between our current ordered extent and the next
    ordered extent

    These tests are racy however since we can set delalloc for these ranges at
    any time. Also for the first case if we notice there is delalloc between
    disk_i_size and our ordered extent we will not update disk_i_size and assume
    that when that delalloc bit gets written out it will update everything
    properly. However if we crash before that we will have file extents outside
    of our i_size, which is not good, so this test is dangerous as well as racy.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • In btrfs_wait_ordered_range we have been calling filemap_fdata_write() twice
    because compression does strange things and then waiting. Then we look up
    ordered extents and if we find any we will always schedule_timeout(); once
    and then loop back around and do it all again. We will even check to see if
    there is delalloc pages on this range and loop again. So this patch gets
    rid of the multipe fdata_write() calls and just does
    filemap_write_and_wait(). In the case of compression we will still find the
    ordered extents and start those individually if we need to so that is ok,
    but in the normal buffered case we avoid all this weird overhead.

    Then in the case of the schedule_timeout(1), we don't need it. All callers
    either 1) don't care, they just want to make sure what they just wrote maeks
    it to disk or 2) are doing the lock()->lookup ordered->unlock->flush thing
    in which case it will lock and check for ordered extents _anyway_ so get
    back to them as quickly as possible. The delaloc check is simply not
    needed, this only catches the case where we write to the file again since
    doing the filemap_write_and_wait() and if the caller truly cares about that
    it will take care of everything itself. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 Mar, 2012

2 commits


28 Mar, 2011

1 commit

  • Tracepoints can provide insight into why btrfs hits bugs and be greatly
    helpful for debugging, e.g
    dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
    dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
    btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
    flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
    flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
    flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
    flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

    Here is what I have added:

    1) ordere_extent:
    btrfs_ordered_extent_add
    btrfs_ordered_extent_remove
    btrfs_ordered_extent_start
    btrfs_ordered_extent_put

    These provide critical information to understand how ordered_extents are
    updated.

    2) extent_map:
    btrfs_get_extent

    extent_map is used in both read and write cases, and it is useful for tracking
    how btrfs specific IO is running.

    3) writepage:
    __extent_writepage
    btrfs_writepage_end_io_hook

    Pages are cirtical resourses and produce a lot of corner cases during writeback,
    so it is valuable to know how page is written to disk.

    4) inode:
    btrfs_inode_new
    btrfs_inode_request
    btrfs_inode_evict

    These can show where and when a inode is created, when a inode is evicted.

    5) sync:
    btrfs_sync_file
    btrfs_sync_fs

    These show sync arguments.

    6) transaction:
    btrfs_transaction_commit

    In transaction based filesystem, it will be useful to know the generation and
    who does commit.

    7) back reference and cow:
    btrfs_delayed_tree_ref
    btrfs_delayed_data_ref
    btrfs_delayed_ref_head
    btrfs_cow_block

    Btrfs natively supports back references, these tracepoints are helpful on
    understanding btrfs's COW mechanism.

    8) chunk:
    btrfs_chunk_alloc
    btrfs_chunk_free

    Chunk is a link between physical offset and logical offset, and stands for space
    infomation in btrfs, and these are helpful on tracing space things.

    9) reserved_extent:
    btrfs_reserved_extent_alloc
    btrfs_reserved_extent_free

    These can show how btrfs uses its space.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

01 Feb, 2011

1 commit


22 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • The new DIO bio splitting code has problems when the bio
    spans more than one ordered extent. This will happen as the
    generic DIO code merges our get_blocks calls together into
    a bigger single bio.

    This fixes things by walking forward in the ordered extent
    code finding all the overlapping ordered extents and completing them
    all at once.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Oct, 2010

1 commit

  • These are all the cases where a variable is set, but not read which are
    not bugs as far as I can see, but simply leftovers.

    Still needs more review.

    Found by gcc 4.6's new warnings

    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     

25 May, 2010

2 commits

  • This provides basic DIO support for reading and writing. It does not do the
    work to recover from mismatching checksums, that will come later. A few design
    changes have been made from Jim's code (sorry Jim!)

    1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
    code in order to account for all of BTRFS's oddities, but thanks to that work it
    seems like the best bet is to just ignore compression and such and just opt to
    fallback on buffered IO.

    2) Fallback on buffered IO for compressed or inline extents. Jim's code did
    it's own buffering to make dio with compressed extents work. Now we just
    fallback onto normal buffered IO.

    3) Use ordered extents for the writes so that all of the

    lock_extent()
    lookup_ordered()

    type checks continue to work.

    4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
    DIO writes.

    I've tested this with fsx and everything works great. This patch depends on my
    dio and filemap.c patches to work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Introduce metadata reservation context for delayed allocation
    and update various related functions.

    This patch also introduces EXTENT_FIRST_DELALLOC control bit for
    set/clear_extent_bit. It tells set/clear_bit_hook whether they
    are processing the first extent_state with EXTENT_DELALLOC bit
    set. This change is important if set/clear_extent_bit involves
    multiple extent_state.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

06 Apr, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: add check for changed leaves in setup_leaf_for_split
    Btrfs: create snapshot references in same commit as snapshot
    Btrfs: fix small race with delalloc flushing waitqueue's
    Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
    Btrfs: fix chunk allocate size calculation
    Btrfs: kill max_extent mount option
    Btrfs: fail to mount if we have problems reading the block groups
    Btrfs: check btrfs_get_extent return for IS_ERR()
    Btrfs: handle kmalloc() failure in inode lookup ioctl
    Btrfs: dereferencing freed memory
    Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
    Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
    Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
    Btrfs: add NULL check for do_walk_down()
    Btrfs: remove duplicate include in ioctl.c

    Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
    cleanups.

    Linus Torvalds
     

31 Mar, 2010

1 commit

  • As Yan pointed out, theres not much reason for all this complicated math to
    account for file extents being split up into max_extent chunks, since they are
    likely to all end up in the same leaf anyway. Since there isn't much reason to
    use max_extent, just remove the option altogether so we have one less thing we
    need to test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

15 Mar, 2010

2 commits

  • When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
    run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
    already, so we're searching twice when we don't have to. This patch lets us
    pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
    complete io on that ordered extent we can just use the one we found then instead
    of having to do another btrfs_lookup_ordered_extent. This made my fio job with
    the other patch go from 24 mb/s to 29 mb/s.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The ordered tree used to need a mutex, but currently all we use it for is to
    protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
    instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
    less latency, 3445.138 ms instead of 3820.633 ms.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

18 Jan, 2010

1 commit


18 Dec, 2009

1 commit

  • iput() can trigger new transactions if we are dropping the
    final reference, so calling it in btrfs_commit_transaction
    may end up deadlock. This patch adds delayed iput to avoid
    the issue.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng