21 Nov, 2014

4 commits

  • We try to allocate an extent state structure before acquiring the extent
    state tree's spinlock as we might need a new one later and therefore avoid
    doing later an atomic allocation while holding the tree's spinlock. However
    we returned -ENOMEM if that initial non-atomic allocation failed, which is
    a bit excessive since we might end up not needing the pre-allocated extent
    state at all - for the case where the tree doesn't have any extent states
    that cover the input range and cover too any other range. Therefore don't
    return -ENOMEM if that pre-allocation fails.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We try to allocate an extent state before acquiring the tree's spinlock
    just in case we end up needing to split an existing extent state into two.
    If that allocation failed, we would return -ENOMEM.
    However, our only single caller (transaction/log commit code), passes in
    an extent state that was cached from a call to find_first_extent_bit() and
    that has a very high chance to match exactly the input range (always true
    for a transaction commit and very often, but not always, true for a log
    commit) - in this case we end up not needing at all that initial extent
    state used for an eventual split. Therefore just don't return -ENOMEM if
    we can't allocate the temporary extent state, since we might not need it
    at all, and if we end up needing one, we'll do it later anyway.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Right now the only caller of find_first_extent_bit() that is interested
    in caching extent states (transaction or log commit), never gets an extent
    state cached. This is because find_first_extent_bit() only caches states
    that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
    the transaction/log commit caller always passes a tree that doesn't have
    ever extent states with any of those flags (they can only have one of the
    following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).

    This change together with the following one in the patch series (titled
    "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
    help reduce significantly the chances of calls to convert_extent_bit()
    fail with -ENOMEM when called from the transaction/log commit code.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
    we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
    but we don't tag the pages, nor the inode's mapping, with an error. This makes it
    impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
    for e.g.) know that there was an error.

    Note that the return value of submit_compressed_extents() is useless, as that function
    is executed by a workqueue task and not directly by the fill_delalloc callback. This
    means the writepage/s callbacks of the inode's address space operations don't get that
    return value.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

05 Oct, 2014

1 commit


04 Oct, 2014

3 commits

  • While we have a transaction ongoing, the VM might decide at any time
    to call btree_inode->i_mapping->a_ops->writepages(), which will start
    writeback of dirty pages belonging to btree nodes/leafs. This call
    might return an error or the writeback might finish with an error
    before we attempt to commit the running transaction. If this happens,
    we might have no way of knowing that such error happened when we are
    committing the transaction - because the pages might no longer be
    marked dirty nor tagged for writeback (if a subsequent modification
    to the extent buffer didn't happen before the transaction commit) which
    makes filemap_fdata[write|wait]_range unable to find such pages (even
    if they're marked with SetPageError).
    So if this happens we must abort the transaction, otherwise we commit
    a super block with btree roots that point to btree nodes/leafs whose
    content on disk is invalid - either garbage or the content of some
    node/leaf from a past generation that got cowed or deleted and is no
    longer valid (for this later case we end up getting error messages like
    "parent transid verify failed on 10826481664 wanted 25748 found 29562"
    when reading btree nodes/leafs from disk).

    Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
    i_mapping would not be enough because we need to distinguish between
    log tree extents (not fatal) vs non-log tree extents (fatal) and
    because the next call to filemap_fdatawait_range() will catch and clear
    such errors in the mapping - and that call might be from a log sync and
    not from a transaction commit, which means we would not know about the
    error at transaction commit time. Also, checking for the eb flag
    EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
    not be completely reliable, as the eb might be removed from memory and
    read back when trying to get it, which clears that flag right before
    reading the eb's pages from disk, making us not know about the previous
    write error.

    Using the new 3 flags for the btree inode also makes us achieve the
    goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
    writeback for all dirty pages and before filemap_fdatawait_range() is
    called, the writeback for all dirty pages had already finished with
    errors - because we were not using AS_EIO/AS_ENOSPC,
    filemap_fdatawait_range() would return success, as it could not know
    that writeback errors happened (the pages were no longer tagged for
    writeback).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • This is actually inspired by Filipe's patch. When write_one_eb() fails on
    submit_extent_page(), it'll give up writing this eb and mark it with
    EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
    there are some left pages which remain DIRTY, and if a later COW on this eb
    happens, ie. eb is COWed and freed, it'd run into BUG_ON in
    btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));

    This adds the missing clear_page_dirty_for_io() for the rest pages of eb.

    Signed-off-by: Liu Bo
    Reviewed-by: Filipe Manana
    Signed-off-by: Chris Mason

    Liu Bo
     
  • If submit_extent_page() fails in write_one_eb(), we end up with the current
    page not marked dirty anymore, unlocked and marked for writeback. But we never
    end up calling end_page_writeback() against the page, which will make calls to
    filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
    for the writeback bit to be cleared from the page.

    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Filipe Manana
     

02 Oct, 2014

2 commits


18 Sep, 2014

13 commits

  • After the data is written successfully, we should cleanup the read failure record
    in that range because
    - If we set data COW for the file, the range that the failure record pointed to is
    mapped to a new place, so it is invalid.
    - If we set no data COW for the file, and if there is no error during writting,
    the corrupted data is corrected, so the failure record can be removed. And if
    some errors happen on the mirrors, we also needn't worry about it because the
    failure record will be recreated if we read the same place again.

    Sometimes, we may fail to correct the data, so the failure records will be left
    in the tree, we need free them when we free the inode or the memory leak happens.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • This patch implement data repair function when direct read fails.

    The detail of the implementation is:
    - When we find the data is not right, we try to read the data from the other
    mirror.
    - When the io on the mirror ends, we will insert the endio work into the
    dedicated btrfs workqueue, not common read endio workqueue, because the
    original endio work is still blocked in the btrfs endio workqueue, if we
    insert the endio work of the io on the mirror into that workqueue, deadlock
    would happen.
    - After we get right data, we write it back to the corrupted mirror.
    - And if the data on the new mirror is still corrupted, we will try next
    mirror until we read right data or all the mirrors are traversed.
    - After the above work, we set the uptodate flag according to the result.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • We could not use clean_io_failure in the direct IO path because it got the
    filesystem information from the page structure, but the page in the direct
    IO bio didn't have the filesystem information in its structure. So we need
    modify it and pass all the information it need by parameters.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The original code of repair_io_failure was just used for buffered read,
    because it got some filesystem data from page structure, it is safe for
    the page in the page cache. But when we do a direct read, the pages in bio
    are not in the page cache, that is there is no filesystem data in the page
    structure. In order to implement direct read data repair, we need modify
    repair_io_failure and pass all filesystem data it need by function
    parameters.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The data repair function of direct read will be implemented later, and some code
    in bio_readpage_error will be reused, so split bio_readpage_error into
    several functions which will be used in direct read repair later.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • We forgot to free failure record and bio after submitting re-read bio failed,
    fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Direct IO splits the original bio to several sub-bios because of the limit of
    raid stripe, and the filesystem will wait for all sub-bios and then run final
    end io process.

    But it was very hard to implement the data repair when dio read failure happens,
    because at the final end io function, we didn't know which mirror the data was
    read from. So in order to implement the data repair, we have to move the file data
    check in the final end io function to the sub-bio end io function, in which we can
    get the mirror number of the device we access. This patch did this work as the
    first step of the direct io data repair implementation.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The current code would load checksum data for several times when we split
    a whole direct read io because of the limit of the raid stripe, it would
    make us search the csum tree for several times. In fact, it just wasted time,
    and made the contention of the csum tree root be more serious. This patch
    improves this problem by loading the data at once.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • We have been iterating all references for each extent we have in a file when we
    do fiemap to see if it is shared. This is fine when you have a few clones or a
    few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
    and stares at you. So add btrfs_check_shared which will use the backref walking
    code but will short circuit as soon as it finds a root or inode that doesn't
    match the one we currently have. This makes fiemap on my testbox go from
    looking at me blankly for a day to spitting out actual output in a reasonable
    amount of time. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We've defined a 'offset' out of bio_for_each_segment_all.

    This is just a clean rename, no function changes.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • The tree field of struct extent_state was only used to figure out if
    an extent state was connected to an inode's io tree or not. For this
    we can just use the rb_node field itself.

    On a x86_64 system with this change the sizeof(struct extent_state) is
    reduced from 96 bytes down to 88 bytes, meaning that with a page size
    of 4096 bytes we can now store 46 extent states per page instead of 42.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • btrfs_set_key_type and btrfs_key_type are used inconsistently along with
    open coded variants. Other members of btrfs_key are accessed directly
    without any helpers anyway.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     

28 Aug, 2014

1 commit

  • Pull btrfs fixes from Chris Mason:
    "The biggest of these comes from Liu Bo, who tracked down a hang we've
    been hitting since moving to kernel workqueues (it's a btrfs bug, not
    in the generic code). His patch needs backporting to 3.16 and 3.15
    stable, which I'll send once this is in.

    Otherwise these are assorted fixes. Most were integrated last week
    during KS, but I wanted to give everyone the chance to test the
    result, so I waited for rc2 to come out before sending"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
    Btrfs: fix task hang under heavy compressed write
    Btrfs: fix filemap_flush call in btrfs_file_release
    Btrfs: fix crash on endio of reading corrupted block
    btrfs: fix leak in qgroup_subtree_accounting() error path
    btrfs: Use right extent length when inserting overlap extent map.
    Btrfs: clone, don't create invalid hole extent map
    Btrfs: don't monopolize a core when evicting inode
    Btrfs: fix hole detection during file fsync
    Btrfs: ensure tmpfile inode is always persisted with link count of 0
    Btrfs: race free update of commit root for ro snapshots
    Btrfs: fix regression of btrfs device replace
    Btrfs: don't consider the missing device when allocating new chunks
    Btrfs: Fix wrong device size when we are resizing the device
    Btrfs: don't write any data into a readonly device when scrub
    Btrfs: Fix the problem that the replace destroys the seed filesystem
    btrfs: Return right extent when fiemap gives unaligned offset and len.
    Btrfs: fix wrong extent mapping for DirectIO
    Btrfs: fix wrong write range for filemap_fdatawrite_range()
    Btrfs: fix wrong missing device counter decrease
    Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
    ...

    Linus Torvalds
     

21 Aug, 2014

1 commit

  • The crash is

    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/extent_io.c:2124!
    [...]
    Workqueue: btrfs-endio normal_work_helper [btrfs]
    RIP: 0010:[] [] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

    This is in fact a regression.

    It is because we forgot to increase @offset properly in reading corrupted block,
    so that the @offset remains, and this leads to checksum errors while reading
    left blocks queued up in the same bio, and then ends up with hiting the above
    BUG_ON.

    Reported-by: Chris Murphy
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

19 Aug, 2014

1 commit

  • When page aligned start and len passed to extent_fiemap(), the result is
    good, but when start and len is not aligned, e.g. start = 1 and len =
    4095 is passed to extent_fiemap(), it returns no extent.

    The problem is that start and len is all rounded down which causes the
    problem. This patch will round down start and round up (start + len) to
    return right extent.

    Reported-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

15 Jun, 2014

1 commit

  • Pull more btrfs updates from Chris Mason:
    "This has a few fixes since our last pull and a new ioctl for doing
    btree searches from userland. It's very similar to the existing
    ioctl, but lets us return larger items back down to the app"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: fix error handling in create_pending_snapshot
    btrfs: fix use of uninit "ret" in end_extent_writepage()
    btrfs: free ulist in qgroup_shared_accounting() error path
    Btrfs: fix qgroups sanity test crash or hang
    btrfs: prevent RCU warning when dereferencing radix tree slot
    Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
    btrfs: new ioctl TREE_SEARCH_V2
    btrfs: tree_search, search_ioctl: direct copy to userspace
    btrfs: new function read_extent_buffer_to_user
    btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
    btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
    btrfs: tree_search, search_ioctl: accept varying buffer
    btrfs: tree_search: eliminate redundant nr_items check

    Linus Torvalds
     

14 Jun, 2014

1 commit

  • If this condition in end_extent_writepage() is false:

    if (tree->ops && tree->ops->writepage_end_io_hook)

    we will then test an uninitialized "ret" at:

    ret = ret < 0 ? ret : -EIO;

    The test for ret is for the case where ->writepage_end_io_hook
    failed, and we'd choose that ret as the error; but if
    there is no ->writepage_end_io_hook, nothing sets ret.

    Initializing ret to 0 should be sufficient; if
    writepage_end_io_hook wasn't set, (!uptodate) means
    non-zero err was passed in, so we choose -EIO in that case.

    Signed-of-by: Eric Sandeen

    Signed-off-by: Chris Mason

    Eric Sandeen
     

13 Jun, 2014

1 commit


12 Jun, 2014

1 commit

  • Pull btrfs updates from Chris Mason:
    "The biggest change here is Josef's rework of the btrfs quota
    accounting, which improves the in-memory tracking of delayed extent
    operations.

    I had been working on Btrfs stack usage for a while, mostly because it
    had become impossible to do long stress runs with slab, lockdep and
    pagealloc debugging turned on without blowing the stack. Even though
    you upgraded us to a nice king sized stack, I kept most of the
    patches.

    We also have some very hard to find corruption fixes, an awesome sysfs
    use after free, and the usual assortment of optimizations, cleanups
    and other fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
    Btrfs: convert smp_mb__{before,after}_clear_bit
    Btrfs: fix scrub_print_warning to handle skinny metadata extents
    Btrfs: make fsync work after cloning into a file
    Btrfs: use right type to get real comparison
    Btrfs: don't check nodes for extent items
    Btrfs: don't release invalid page in btrfs_page_exists_in_range()
    Btrfs: make sure we retry if page is a retriable exception
    Btrfs: make sure we retry if we couldn't get the page
    btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
    trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
    Btrfs: fix leaf corruption after __btrfs_drop_extents
    Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
    Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
    btrfs: free delayed node outside of root->inode_lock
    btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
    Btrfs: fix transaction leak during fsync call
    btrfs: Avoid trucating page or punching hole in a already existed hole.
    Btrfs: update commit root on snapshot creation after orphan cleanup
    Btrfs: ioctl, don't re-lock extent range when not necessary
    Btrfs: avoid visiting all extent items when cloning a range
    ...

    Linus Torvalds
     

10 Jun, 2014

6 commits

  • __extent_writepage has two unrelated parts. First it does the delayed
    allocation dance and second it does the mapping and IO for the page
    we're actually writing.

    This splits it up into those two parts so the stack from one doesn't
    impact the stack from the other.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This adds noinline_for_stack to two helpers used by
    btree_write_cache_pages. It shaves us down from 424 bytes on the
    stack to 280.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • We need to NULL the cached_state after freeing it, otherwise
    we might free it again if find_delalloc_range doesn't find anything.

    Signed-off-by: Chris Mason
    cc: stable@vger.kernel.org

    Chris Mason
     
  • This exercises the various parts of the new qgroup accounting code. We do some
    basic stuff and do some things with the shared refs to make sure all that code
    works. I had to add a bunch of infrastructure because I needed to be able to
    insert items into a fake tree without having to do all the hard work myself,
    hopefully this will be usefull in the future. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • According to commit 865ffef3797da2cac85b3354b5b6050dc9660978
    (fs: fix fsync() error reporting),
    it's not stable to just check error pages because pages can be
    truncated or invalidated, we should also mark mapping with error
    flag so that a later fsync can catch the error.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • When running low on available disk space and having several processes
    doing buffered file IO, I got the following trace in dmesg:

    [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
    [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
    [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000
    [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
    [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
    [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
    [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
    [ 4202.720922] Call Trace:
    [ 4202.720931] [] schedule+0x29/0x70
    [ 4202.720950] [] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
    [ 4202.720956] [] ? bit_waitqueue+0xc0/0xc0
    [ 4202.720972] [] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
    [ 4202.720988] [] normal_work_helper+0x137/0x2c0 [btrfs]
    [ 4202.720994] [] process_one_work+0x1f5/0x530
    (...)
    [ 4202.721027] 2 locks held by kworker/u8:1/5450:
    [ 4202.721028] #0: (%s-%s){++++..}, at: [] process_one_work+0x193/0x530
    [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [] process_one_work+0x193/0x530
    [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
    [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
    [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001
    [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
    [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
    [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
    [ 4202.721718] Call Trace:
    [ 4202.721723] [] schedule+0x29/0x70
    [ 4202.721727] [] schedule_timeout+0x1dc/0x270
    [ 4202.721732] [] ? mark_held_locks+0xb9/0x140
    [ 4202.721736] [] ? _raw_spin_unlock_irq+0x30/0x40
    [ 4202.721740] [] ? trace_hardirqs_on_caller+0x10d/0x1d0
    [ 4202.721744] [] wait_for_completion+0xdf/0x120
    [ 4202.721749] [] ? try_to_wake_up+0x310/0x310
    [ 4202.721765] [] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
    [ 4202.721781] [] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
    [ 4202.721786] [] ? bit_waitqueue+0xc0/0xc0
    [ 4202.721799] [] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
    [ 4202.721813] [] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
    (...)

    It turns out that extent_io.c:__extent_writepage(), which ends up being called
    through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
    -ENOSPC when calling the fill_delalloc callback. In this situation, it returned
    without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
    ever being called for the respective page, which prevents the ordered extent's
    bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
    is never queued into the endio_write_workers queue. This makes the task that
    called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
    extent.

    This is fairly easy to reproduce using a small filesystem and fsstress on
    a quad core vm:

    mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
    mount /dev/sdd /mnt

    fsstress -p 6 -d /mnt -n 100000 -x \
    "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
    -f allocsp=0 \
    -f bulkstat=0 \
    -f bulkstat1=0 \
    -f chown=0 \
    -f creat=1 \
    -f dread=0 \
    -f dwrite=0 \
    -f fallocate=1 \
    -f fdatasync=0 \
    -f fiemap=0 \
    -f freesp=0 \
    -f fsync=0 \
    -f getattr=0 \
    -f getdents=0 \
    -f link=0 \
    -f mkdir=0 \
    -f mknod=0 \
    -f punch=1 \
    -f read=0 \
    -f readlink=0 \
    -f rename=0 \
    -f resvsp=0 \
    -f rmdir=0 \
    -f setxattr=0 \
    -f stat=0 \
    -f symlink=0 \
    -f sync=0 \
    -f truncate=1 \
    -f unlink=0 \
    -f unresvsp=0 \
    -f write=4

    So just ensure that if an error happens while writing the extent page
    we call the writepage_end_io_hook callback. Also make it return the
    error code and ensure the caller (extent_write_cache_pages) processes
    all pages in the page vector even if an error happens only for some
    of them, so that ordered extents end up released.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

05 Jun, 2014

1 commit

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Apr, 2014

1 commit

  • Pull second set of btrfs updates from Chris Mason:
    "The most important changes here are from Josef, fixing a btrfs
    regression in 3.14 that can cause corruptions in the extent allocation
    tree when snapshots are in use.

    Josef also fixed some deadlocks in send/recv and other assorted races
    when balance is running"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
    Btrfs: fix compile warnings on on avr32 platform
    btrfs: allow mounting btrfs subvolumes with different ro/rw options
    btrfs: export global block reserve size as space_info
    btrfs: fix crash in remount(thread_pool=) case
    Btrfs: abort the transaction when we don't find our extent ref
    Btrfs: fix EINVAL checks in btrfs_clone
    Btrfs: fix unlock in __start_delalloc_inodes()
    Btrfs: scrub raid56 stripes in the right way
    Btrfs: don't compress for a small write
    Btrfs: more efficient io tree navigation on wait_extent_bit
    Btrfs: send, build path string only once in send_hole
    btrfs: filter invalid arg for btrfs resize
    Btrfs: send, fix data corruption due to incorrect hole detection
    Btrfs: kmalloc() doesn't return an ERR_PTR
    Btrfs: fix snapshot vs nocow writting
    btrfs: Change the expanding write sequence to fix snapshot related bug.
    btrfs: make device scan less noisy
    btrfs: fix lockdep warning with reclaim lock inversion
    Btrfs: hold the commit_root_sem when getting the commit root during send
    Btrfs: remove transaction from send
    ...

    Linus Torvalds