23 Jan, 2016

1 commit

  • Pull more btrfs updates from Chris Mason:
    "These are mostly fixes that we've been testing, but also we grabbed
    and tested a few small cleanups that had been on the list for a while.

    Zhao Lei's patchset also fixes some early ENOSPC buglets"

    * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
    btrfs: raid56: Use raid_write_end_io for scrub
    btrfs: Remove unnecessary ClearPageUptodate for raid56
    btrfs: use rbio->nr_pages to reduce calculation
    btrfs: Use unified stripe_page's index calculation
    btrfs: Fix calculation of rbio->dbitmap's size calculation
    btrfs: Fix no_space in write and rm loop
    btrfs: merge functions for wait snapshot creation
    btrfs: delete unused argument in btrfs_copy_from_user
    btrfs: Use direct way to determine raid56 write/recover mode
    btrfs: Small cleanup for get index_srcdev loop
    btrfs: Enhance chunk validation check
    btrfs: Enhance super validation check
    Btrfs: fix deadlock running delayed iputs at transaction commit time
    Btrfs: fix typo in log message when starting a balance
    btrfs: remove duplicate const specifier
    btrfs: initialize the seq counter in struct btrfs_device
    Btrfs: clean up an error code in btrfs_init_space_info()
    btrfs: fix iterator with update error in backref.c
    Btrfs: fix output of compression message in btrfs_parse_options()
    Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
    ...

    Linus Torvalds
     

20 Jan, 2016

2 commits

  • wait_for_snapshot_creation() is in same group with oher two:
    btrfs_start_write_no_snapshoting()
    btrfs_end_write_no_snapshoting()

    Rename wait_for_snapshot_creation() and move it into same place
    with other two.

    Signed-off-by: Zhao Lei
    Signed-off-by: Chris Mason

    Zhao Lei
     
  • While running a stress test I ran into a deadlock when running the delayed
    iputs at transaction time, which produced the following report and trace:

    [ 886.399989] =============================================
    [ 886.400871] [ INFO: possible recursive locking detected ]
    [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
    [ 886.402384] ---------------------------------------------
    [ 886.403182] fio/8277 is trying to acquire lock:
    [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.403568]
    [ 886.403568] but task is already holding lock:
    [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.403568]
    [ 886.403568] other info that might help us debug this:
    [ 886.403568] Possible unsafe locking scenario:
    [ 886.403568]
    [ 886.403568] CPU0
    [ 886.403568] ----
    [ 886.403568] lock(&fs_info->delayed_iput_sem);
    [ 886.403568] lock(&fs_info->delayed_iput_sem);
    [ 886.403568]
    [ 886.403568] *** DEADLOCK ***
    [ 886.403568]
    [ 886.403568] May be due to missing lock nesting notation
    [ 886.403568]
    [ 886.403568] 3 locks held by fio/8277:
    [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0
    [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [] btrfs_file_write_iter+0x73/0x408 [btrfs]
    [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.403568]
    [ 886.403568] stack backtrace:
    [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
    [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
    [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
    [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
    [ 886.403568] Call Trace:
    [ 886.403568] [] dump_stack+0x4e/0x79
    [ 886.403568] [] __lock_acquire+0xd42/0xf0b
    [ 886.403568] [] ? __module_address+0xdf/0x108
    [ 886.403568] [] lock_acquire+0x10d/0x194
    [ 886.403568] [] ? lock_acquire+0x10d/0x194
    [ 886.403568] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.489542] [] down_read+0x3e/0x4d
    [ 886.489542] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.489542] [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
    [ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
    [ 886.489542] [] flush_space+0x435/0x44a [btrfs]
    [ 886.489542] [] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
    [ 886.489542] [] reserve_metadata_bytes+0x28d/0x384 [btrfs]
    [ 886.489542] [] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
    [ 886.489542] [] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
    [ 886.489542] [] btrfs_evict_inode+0x394/0x55a [btrfs]
    [ 886.489542] [] evict+0xa7/0x15c
    [ 886.489542] [] iput+0x1d3/0x266
    [ 886.489542] [] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
    [ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
    [ 886.489542] [] ? signal_pending_state+0x31/0x31
    [ 886.489542] [] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
    [ 886.489542] [] btrfs_check_data_free_space+0x40/0x59 [btrfs]
    [ 886.489542] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
    [ 886.489542] [] btrfs_direct_IO+0x10c/0x27e [btrfs]
    [ 886.489542] [] generic_file_direct_write+0xb3/0x128
    [ 886.489542] [] btrfs_file_write_iter+0x229/0x408 [btrfs]
    [ 886.489542] [] ? __lock_is_held+0x38/0x50
    [ 886.489542] [] __vfs_write+0x7c/0xa5
    [ 886.489542] [] vfs_write+0xa0/0xe4
    [ 886.489542] [] SyS_write+0x50/0x7e
    [ 886.489542] [] entry_SYSCALL_64_fastpath+0x12/0x6f
    [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
    [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
    [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000
    [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
    [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
    [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
    [ 1081.880782] Call Trace:
    [ 1081.881793] [] schedule+0x7f/0x97
    [ 1081.883340] [] rwsem_down_write_failed+0x2d5/0x325
    [ 1081.895525] [] ? trace_hardirqs_on_caller+0x16/0x1ab
    [ 1081.897419] [] call_rwsem_down_write_failed+0x13/0x20
    [ 1081.899251] [] ? call_rwsem_down_write_failed+0x13/0x20
    [ 1081.901063] [] ? __down_write_nested.isra.0+0x1f/0x21
    [ 1081.902365] [] down_write+0x43/0x57
    [ 1081.903846] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
    [ 1081.906078] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
    [ 1081.908846] [] ? mark_held_locks+0x56/0x6c
    [ 1081.910409] [] btrfs_check_data_free_space+0x40/0x59 [btrfs]
    [ 1081.912482] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
    [ 1081.914597] [] btrfs_direct_IO+0x10c/0x27e [btrfs]
    [ 1081.919037] [] generic_file_direct_write+0xb3/0x128
    [ 1081.920754] [] btrfs_file_write_iter+0x229/0x408 [btrfs]
    [ 1081.922496] [] ? __lock_is_held+0x38/0x50
    [ 1081.923922] [] __vfs_write+0x7c/0xa5
    [ 1081.925275] [] vfs_write+0xa0/0xe4
    [ 1081.926584] [] SyS_write+0x50/0x7e
    [ 1081.927968] [] entry_SYSCALL_64_fastpath+0x12/0x6f
    [ 1081.985293] INFO: lockdep is turned off.
    [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
    [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
    [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000
    [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
    [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
    [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
    [ 1081.996485] Call Trace:
    [ 1081.997037] [] schedule+0x7f/0x97
    [ 1081.998017] [] rwsem_down_write_failed+0x2d5/0x325
    [ 1081.999241] [] ? finish_wait+0x6d/0x76
    [ 1082.000306] [] call_rwsem_down_write_failed+0x13/0x20
    [ 1082.001533] [] ? call_rwsem_down_write_failed+0x13/0x20
    [ 1082.002776] [] ? __down_write_nested.isra.0+0x1f/0x21
    [ 1082.003995] [] down_write+0x43/0x57
    [ 1082.005000] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
    [ 1082.007403] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
    [ 1082.008988] [] btrfs_fallocate+0x7c1/0xc2f [btrfs]
    [ 1082.010193] [] ? percpu_down_read+0x4e/0x77
    [ 1082.011280] [] ? __sb_start_write+0x5f/0xb0
    [ 1082.012265] [] ? __sb_start_write+0x5f/0xb0
    [ 1082.013021] [] vfs_fallocate+0x170/0x1ff
    [ 1082.013738] [] ioctl_preallocate+0x89/0x9b
    [ 1082.014778] [] do_vfs_ioctl+0x40a/0x4ea
    [ 1082.015778] [] ? SYSC_newfstat+0x25/0x2e
    [ 1082.016806] [] ? __fget_light+0x4d/0x71
    [ 1082.017789] [] SyS_ioctl+0x57/0x79
    [ 1082.018706] [] entry_SYSCALL_64_fastpath+0x12/0x6f

    This happens because we can recursively acquire the semaphore
    fs_info->delayed_iput_sem when attempting to allocate space to satisfy
    a file write request as shown in the first trace above - when committing
    a transaction we acquire (down_read) the semaphore before running the
    delayed iputs, and when running a delayed iput() we can end up calling
    an inode's eviction handler, which in turn commits another transaction
    and attempts to acquire (down_read) again the semaphore to run more
    delayed iput operations.
    This results in a deadlock because if a task acquires multiple times a
    semaphore it should invoke down_read_nested() with a different lockdep
    class for each level of recursion.

    Fix this by simplifying the implementation and use a mutex instead that
    is acquired by the cleaner kthread before it runs the delayed iputs
    instead of always acquiring a semaphore before delayed references are
    run from anywhere.

    Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
    Cc: stable@vger.kernel.org # 4.1+
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

19 Jan, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "This has our usual assortment of fixes and cleanups, but the biggest
    change included is Omar Sandoval's free space tree. It's not the
    default yet, mounting -o space_cache=v2 enables it and sets a readonly
    compat bit. The tree can actually be deleted and regenerated if there
    are any problems, but it has held up really well in testing so far.

    For very large filesystems (30T+) our existing free space caching code
    can end up taking a huge amount of time during commits. The new tree
    based code is faster and less work overall to update as the commit
    progresses.

    Omar worked on this during the summer and we'll hammer on it in
    production here at FB over the next few months"

    * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
    Btrfs: fix fitrim discarding device area reserved for boot loader's use
    Btrfs: Check metadata redundancy on balance
    btrfs: statfs: report zero available if metadata are exhausted
    btrfs: preallocate path for snapshot creation at ioctl time
    btrfs: allocate root item at snapshot ioctl time
    btrfs: do an allocation earlier during snapshot creation
    btrfs: use smaller type for btrfs_path locks
    btrfs: use smaller type for btrfs_path lowest_level
    btrfs: use smaller type for btrfs_path reada
    btrfs: cleanup, use enum values for btrfs_path reada
    btrfs: constify static arrays
    btrfs: constify remaining structs with function pointers
    btrfs tests: replace whole ops structure for free space tests
    btrfs: use list_for_each_entry* in backref.c
    btrfs: use list_for_each_entry_safe in free-space-cache.c
    btrfs: use list_for_each_entry* in check-integrity.c
    Btrfs: use linux/sizes.h to represent constants
    btrfs: cleanup, remove stray return statements
    btrfs: zero out delayed node upon allocation
    btrfs: pass proper enum type to start_transaction()
    ...

    Linus Torvalds
     

11 Jan, 2016

2 commits


07 Jan, 2016

7 commits

  • The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:

    * overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
    * better packing in a slab page +6 objects
    * the whole structure now fits to 2 cachelines
    * slight decrease in code size:

    text data bss dec hex filename
    938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before
    938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after

    (and the generated assembly does not change much)

    The main purpose is to decrease the size of the structure without
    affecting performance. The byte access is usually well behaving accross
    arches, the locks are not accessed frequently and sometimes just
    compared to zero.

    Note for further size reduction attempts: the slots could be made u16
    but this might generate worse code on some arches (non-byte and non-int
    access). Also the range of operations on slots is wider compared to
    locks and the potential performance drop should be evaluated first.

    Signed-off-by: David Sterba

    David Sterba
     
  • The level is 0..7, we can use smaller type. The size of btrfs_path is now
    136 bytes from 144, which is +2 objects that fit into a 4k slab.

    Signed-off-by: David Sterba

    David Sterba
     
  • The possible values for reada are all positive and bounded, we can later
    save some bytes by storing it in u8.

    Signed-off-by: David Sterba

    David Sterba
     
  • Replace the integers by enums for better readability. The value 2 does
    not have any meaning since a717531942f488209dded30f6bc648167bcefa72
    "Btrfs: do less aggressive btree readahead" (2009-01-22).

    Signed-off-by: David Sterba

    David Sterba
     
  • There are a few statically initialized arrays that can be made const.
    The remaining (like file_system_type, sysfs attributes or prop handlers)
    do not allow that due to type mismatch when passed to the APIs or
    because the structures are modified through other members.

    Signed-off-by: David Sterba

    David Sterba
     
  • We use many constants to represent size and offset value. And to make
    code readable we use '256 * 1024 * 1024' instead of '268435456' to
    represent '256MB'. However we can make far more readable with 'SZ_256MB'
    which is defined in the 'linux/sizes.h'.

    So this patch replaces 'xxx * 1024 * 1024' kind of expression with
    single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
    not a power of 2. And I haven't touched to '4096' & '8192' because it's
    more intuitive than 'SZ_4KB' & 'SZ_8KB'.

    Signed-off-by: Byongho Lee
    Signed-off-by: David Sterba

    Byongho Lee
     
  • Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
    returning boolean not int.

    Signed-off-by: Alexandru Moise
    Signed-off-by: David Sterba

    Alexandru Moise
     

01 Jan, 2016

1 commit


30 Dec, 2015

1 commit


24 Dec, 2015

2 commits


19 Dec, 2015

1 commit


18 Dec, 2015

5 commits

  • Now we can finally hook up everything so we can actually use free space
    tree. The free space tree is enabled by passing the space_cache=v2 mount
    option. On the first mount with the this option set, the free space tree
    will be created and the FREE_SPACE_TREE read-only compat bit will be
    set. Any time the filesystem is mounted from then on, we must use the
    free space tree. The clear_cache option will also clear the free space
    tree.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     
  • The free space cache has turned out to be a scalability bottleneck on
    large, busy filesystems. When the cache for a lot of block groups needs
    to be written out, we can get extremely long commit times; if this
    happens in the critical section, things are especially bad because we
    block new transactions from happening.

    The main problem with the free space cache is that it has to be written
    out in its entirety and is managed in an ad hoc fashion. Using a B-tree
    to store free space fixes this: updates can be done as needed and we get
    all of the benefits of using a B-tree: checksumming, RAID handling,
    well-understood behavior.

    With the free space tree, we get commit times that are about the same as
    the no cache case with load times slower than the free space cache case
    but still much faster than the no cache case. Free space is represented
    with extents until it becomes more space-efficient to use bitmaps,
    giving us similar space overhead to the free space cache.

    The operations on the free space tree are: adding and removing free
    space, handling the creation and deletion of block groups, and loading
    the free space for a block group. We can also create the free space tree
    by walking the extent tree and clear the free space tree.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     
  • The on-disk format for the free space tree is straightforward. Each
    block group is represented in the free space tree by a free space info
    item that stores accounting information: whether the free space for this
    block group is stored as bitmaps or extents and how many extents of free
    space exist for this block group (regardless of which format is being
    used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
    with no corresponding item, and bitmaps instead have the
    FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
    array of bytes.

    Reviewed-by: Josef Bacik
    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     
  • We're also going to load the free space tree from caching_thread(), so
    we should refactor some of the common code.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     
  • We're finally going to add one of these for the free space tree, so
    let's add the same nice helpers that we have for the incompat bits.
    While we're add it, also add helpers to clear the bits.

    Reviewed-by: Josef Bacik
    Signed-off-by: Omar Sandoval
    Signed-off-by: Chris Mason

    Omar Sandoval
     

08 Dec, 2015

1 commit

  • The btrfs clone ioctls are now adopted by other file systems, with NFS
    and CIFS already having support for them, and XFS being under active
    development. To avoid growth of various slightly incompatible
    implementations, add one to the VFS. Note that clones are different from
    file copies in several ways:

    - they are atomic vs other writers
    - they support whole file clones
    - they support 64-bit legth clones
    - they do not allow partial success (aka short writes)
    - clones are expected to be a fast metadata operation

    Because of that it would be rather cumbersome to try to piggyback them on
    top of the recent clone_file_range infrastructure. The converse isn't
    true and the clone_file_range system call could try clone file range as
    a first attempt to copy, something that further patches will enable.

    Based on earlier work from Peng Tao.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Dec, 2015

2 commits


02 Dec, 2015

1 commit

  • This rearranges the existing COPY_RANGE ioctl implementation so that the
    .copy_file_range file operation can call the core loop that copies file
    data extent items.

    The extent copying loop is lifted up into its own function. It retains
    the core btrfs error checks that should be shared.

    Signed-off-by: Zach Brown
    [Anna Schumaker: Make flags an unsigned int,
    Check for COPY_FR_REFLINK]
    Signed-off-by: Anna Schumaker
    Reviewed-by: Josef Bacik
    Reviewed-by: David Sterba
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Zach Brown
     

28 Nov, 2015

1 commit

  • Pull btrfs fixes from Chris Mason:
    "This has Mark Fasheh's patches to fix quota accounting during subvol
    deletion, which we've been working on for a while now. The patch is
    pretty small but it's a key fix.

    Otherwise it's a random assortment"

    * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: fix balance range usage filters in 4.4-rc
    btrfs: qgroup: account shared subtree during snapshot delete
    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
    btrfs: qgroup: fix quota disable during rescan
    Btrfs: fix race between cleaner kthread and space cache writeout
    Btrfs: fix scrub preventing unused block groups from being deleted
    Btrfs: fix race between scrub and block group deletion
    btrfs: fix rcu warning during device replace
    btrfs: Continue replace when set_block_ro failed
    btrfs: fix clashing number of the enhanced balance usage filter
    Btrfs: fix the number of transaction units needed to remove a block group
    Btrfs: use global reserve when deleting unused block group after ENOSPC
    Btrfs: tests: checking for NULL instead of IS_ERR()
    btrfs: fix signed overflows in btrfs_sync_file

    Linus Torvalds
     

25 Nov, 2015

3 commits

  • Currently scrub can race with the cleaner kthread when the later attempts
    to delete an unused block group, and the result is preventing the cleaner
    kthread from ever deleting later the block group - unless the block group
    becomes used and unused again. The following diagram illustrates that
    race:

    CPU 1 CPU 2

    cleaner kthread
    btrfs_delete_unused_bgs()

    gets block group X from
    fs_info->unused_bgs and
    removes it from that list

    scrub_enumerate_chunks()

    searches device tree using
    its commit root

    finds device extent for
    block group X

    gets block group X from the tree
    fs_info->block_group_cache_tree
    (via btrfs_lookup_block_group())

    sets bg X to RO

    sees the block group is
    already RO and therefore
    doesn't delete it nor adds
    it back to unused list

    So fix this by making scrub add the block group again to the list of
    unused block groups if the block group is still unused when it finished
    scrubbing it and it hasn't been removed already.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We were using only 1 transaction unit when attempting to delete an unused
    block group but in reality we need 3 + N units, where N corresponds to the
    number of stripes. We were accounting only for the addition of the orphan
    item (for the block group's free space cache inode) but we were not
    accounting that we need to delete one block group item from the extent
    tree, one free space item from the tree of tree roots and N device extent
    items from the device tree.

    While one unit is not enough, it worked most of the time because for each
    single unit we are too pessimistic and assume an entire tree path, with
    the highest possible heigth (8), needs to be COWed with eventual node
    splits at every possible level in the tree, so there was usually enough
    reserved space for removing all the items and adding the orphan item.

    However after adding the orphan item, writepages() can by called by the VM
    subsystem against the btree inode when we are under memory pressure, which
    causes writeback to start for the nodes we COWed before, this forces the
    operation to remove the free space item to COW again some (or all of) the
    same nodes (in the tree of tree roots). Even without writepages() being
    called, we could fail with ENOSPC because these items are located in
    multiple trees and one of them might have a higher heigth and require
    node/leaf splits at many levels, exhausting all the reserved space before
    removing all the items and adding the orphan.

    In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in
    btrfs_orphan_add() when delete unused block group"), we attempted to fix
    a BUG_ON due to ENOSPC when trying to add the orphan item by making the
    cleaner kthread reserve one transaction unit before attempting to remove
    the block group, but this was not enough. We had a couple user reports
    still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
    a 4.2-rc6 kernel for example:

    http://www.spinics.net/lists/linux-btrfs/msg46070.html

    So fix this by reserving all the necessary units of metadata.

    Reported-by: Stefan Priebe
    Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • It's possible to reach a state where the cleaner kthread isn't able to
    start a transaction to delete an unused block group due to lack of enough
    free metadata space and due to lack of unallocated device space to allocate
    a new metadata block group as well. If this happens try to use space from
    the global block group reserve just like we do for unlink operations, so
    that we don't reach a permanent state where starting a transaction for
    filesystem operations (file creation, renames, etc) keeps failing with
    -ENOSPC. Such an unfortunate state was observed on a machine where over
    a dozen unused data block groups existed and the cleaner kthread was
    failing to delete them due to ENOSPC error when attempting to start a
    transaction, and even running balance with a -dusage=0 filter failed with
    ENOSPC as well. Also unmounting and mounting again the filesystem didn't
    help. Allowing the cleaner kthread to use the global block reserve to
    delete the unused data block groups fixed the problem.

    Signed-off-by: Filipe Manana
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Filipe Manana
     

08 Nov, 2015

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     

07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Oct, 2015

4 commits

  • Between btrfs_allocerved_file_extent() and
    btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
    are run and delayed ref head maybe freed before
    btrfs_add_delayed_qgroup_reserve().

    This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
    and cause transaction to be aborted.

    This patch will record qgroup reserve space info into delayed_ref_head
    at btrfs_add_delayed_ref(), to eliminate the race window.

    Reported-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • Similar to the 'limit' filter, we can enhance the 'usage' filter to
    accept a range. The change is backward compatible, the range is applied
    only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

    We don't have a usecase yet, the current syntax has been sufficient. The
    enhancement should provide parity with other range-like filters.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • Balance block groups which have the given number of stripes, defined by
    a range min..max. This is useful to selectively rebalance only chunks
    that do not span enough devices, applies to RAID0/10/5/6.

    Signed-off-by: Gabríel Arthúr Pétursson
    [ renamed bargs members, added to the UAPI, wrote the changelog ]
    Signed-off-by: David Sterba

    Signed-off-by: Chris Mason

    Gabríel Arthúr Pétursson
     
  • The 'limit' filter is underdesigned, it should have been a range for
    [min,max], with some relaxed semantics when one of the bounds is
    missing. Besides that, using a full u64 for a single value is a waste of
    bytes.

    Let's fix both by extending the use of the u64 bytes for the [min,max]
    range. This can be done in a backward compatible way, the range will be
    interpreted only if the appropriate flag is set
    (BTRFS_BALANCE_ARGS_LIMIT_RANGE).

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     

26 Oct, 2015

1 commit

  • In the kernel 4.2 merge window we had a big changes to the implementation
    of delayed references and qgroups which made the no_quota field of delayed
    references not used anymore. More specifically the no_quota field is not
    used anymore as of:

    commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

    Leaving the no_quota field actually prevents delayed references from
    getting merged, which in turn cause the following BUG_ON(), at
    fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

    static int run_delayed_tree_ref(...)
    {
    (...)
    BUG_ON(node->ref_mod != 1);
    (...)
    }

    This happens on a scenario like the following:

    1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

    2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
    It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

    3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
    It's not merged with the reference at the tail of the list of refs
    for bytenr X because the reference at the tail, Ref2 is incompatible
    due to Ref2->no_quota != Ref3->no_quota.

    4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
    It's not merged with the reference at the tail of the list of refs
    for bytenr X because the reference at the tail, Ref3 is incompatible
    due to Ref3->no_quota != Ref4->no_quota.

    5) We run delayed references, trigger merging of delayed references,
    through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

    6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
    all other conditions are satisfied too. So Ref1 gets a ref_mod
    value of 2.

    7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
    all other conditions are satisfied too. So Ref2 gets a ref_mod
    value of 2.

    8) Ref1 and Ref2 aren't merged, because they have different values
    for their no_quota field.

    9) Delayed reference Ref1 is picked for running (select_delayed_ref()
    always prefers references with an action == BTRFS_ADD_DELAYED_REF).
    So run_delayed_tree_ref() is called for Ref1 which triggers the
    BUG_ON because Ref1->red_mod != 1 (equals 2).

    So fix this by removing the no_quota field, as it's not used anymore as
    of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
    qgroup mechanism.").

    The use of no_quota was also buggy in at least two places:

    1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
    no_quota to 0 instead of 1 when the following condition was true:
    is_fstree(ref_root) || !fs_info->quota_enabled

    2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
    reset a node's no_quota when the condition "!is_fstree(root_objectid)
    || !root->fs_info->quota_enabled" was true but we did it only in
    an unused local stack variable, that is, we never reset the no_quota
    value in the node itself.

    This fixes the remainder of problems several people have been having when
    running delayed references, mostly while a balance is running in parallel,
    on a 4.2+ kernel.

    Very special thanks to Stéphane Lesimple for helping debugging this issue
    and testing this fix on his multi terabyte filesystem (which took more
    than one day to balance alone, plus fsck, etc).

    Also, this fixes deadlock issue when using the clone ioctl with qgroups
    enabled, as reported by Elias Probst in the mailing list. The deadlock
    happens because after calling btrfs_insert_empty_item we have our path
    holding a write lock on a leaf of the fs/subvol tree and then before
    releasing the path we called check_ref() which did backref walking, when
    qgroups are enabled, and tried to read lock the same leaf. The trace for
    this case is the following:

    INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
    (...)
    Call Trace:
    [] schedule+0x74/0x83
    [] btrfs_tree_read_lock+0xc0/0xea
    [] ? wait_woken+0x74/0x74
    [] btrfs_search_old_slot+0x51a/0x810
    [] btrfs_next_old_leaf+0xdf/0x3ce
    [] ? ulist_add_merge+0x1b/0x127
    [] __resolve_indirect_refs+0x62a/0x667
    [] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
    [] find_parent_nodes+0xaf3/0xfc6
    [] __btrfs_find_all_roots+0x92/0xf0
    [] btrfs_find_all_roots+0x45/0x65
    [] ? btrfs_get_tree_mod_seq+0x2b/0x88
    [] check_ref+0x64/0xc4
    [] btrfs_clone+0x66e/0xb5d
    [] btrfs_ioctl_clone+0x48f/0x5bb
    [] ? native_sched_clock+0x28/0x77
    [] btrfs_ioctl+0xabc/0x25cb
    (...)

    The problem goes away by eleminating check_ref(), which no longer is
    needed as its purpose was to get a value for the no_quota field of
    a delayed reference (this patch removes the no_quota field as mentioned
    earlier).

    Reported-by: Stéphane Lesimple
    Tested-by: Stéphane Lesimple
    Reported-by: Elias Probst
    Reported-by: Peter Becker
    Reported-by: Malte Schröder
    Reported-by: Derek Dongray
    Reported-by: Erkki Seppala
    Cc: stable@vger.kernel.org # 4.2+
    Signed-off-by: Filipe Manana
    Reviewed-by: Qu Wenruo

    Filipe Manana
     

22 Oct, 2015

2 commits