Eric Lee / smarc-fsl-linux-kernel

23 Jan, 2016

1 commit

2101ae428 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull more btrfs updates from Chris Mason:
"These are mostly fixes that we've been testing, but also we grabbed
and tested a few small cleanups that had been on the list for a while.

Zhao Lei's patchset also fixes some early ENOSPC buglets"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
btrfs: raid56: Use raid_write_end_io for scrub
btrfs: Remove unnecessary ClearPageUptodate for raid56
btrfs: use rbio->nr_pages to reduce calculation
btrfs: Use unified stripe_page's index calculation
btrfs: Fix calculation of rbio->dbitmap's size calculation
btrfs: Fix no_space in write and rm loop
btrfs: merge functions for wait snapshot creation
btrfs: delete unused argument in btrfs_copy_from_user
btrfs: Use direct way to determine raid56 write/recover mode
btrfs: Small cleanup for get index_srcdev loop
btrfs: Enhance chunk validation check
btrfs: Enhance super validation check
Btrfs: fix deadlock running delayed iputs at transaction commit time
Btrfs: fix typo in log message when starting a balance
btrfs: remove duplicate const specifier
btrfs: initialize the seq counter in struct btrfs_device
Btrfs: clean up an error code in btrfs_init_space_info()
btrfs: fix iterator with update error in backref.c
Btrfs: fix output of compression message in btrfs_parse_options()
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
...

Linus Torvalds
2016-01-23 03:49:21 +0800

20 Jan, 2016

2 commits

0bc19f903 btrfs: merge functions for wait snapshot creation ... Browse Code »

wait_for_snapshot_creation() is in same group with oher two:
btrfs_start_write_no_snapshoting()
btrfs_end_write_no_snapshoting()

Rename wait_for_snapshot_creation() and move it into same place
with other two.

Signed-off-by: Zhao Lei
Signed-off-by: Chris Mason

Zhao Lei
2016-01-20 23:22:13 +0800
c2d6cb163 Btrfs: fix deadlock running delayed iputs at transaction commit time ... Browse Code »

While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:

[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [] dump_stack+0x4e/0x79
[ 886.403568] [] __lock_acquire+0xd42/0xf0b
[ 886.403568] [] ? __module_address+0xdf/0x108
[ 886.403568] [] lock_acquire+0x10d/0x194
[ 886.403568] [] ? lock_acquire+0x10d/0x194
[ 886.403568] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [] down_read+0x3e/0x4d
[ 886.489542] [] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [] evict+0xa7/0x15c
[ 886.489542] [] iput+0x1d3/0x266
[ 886.489542] [] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [] ? signal_pending_state+0x31/0x31
[ 886.489542] [] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [] generic_file_direct_write+0xb3/0x128
[ 886.489542] [] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [] ? __lock_is_held+0x38/0x50
[ 886.489542] [] __vfs_write+0x7c/0xa5
[ 886.489542] [] vfs_write+0xa0/0xe4
[ 886.489542] [] SyS_write+0x50/0x7e
[ 886.489542] [] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [] schedule+0x7f/0x97
[ 1081.883340] [] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [] down_write+0x43/0x57
[ 1081.903846] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [] ? __lock_is_held+0x38/0x50
[ 1081.923922] [] __vfs_write+0x7c/0xa5
[ 1081.925275] [] vfs_write+0xa0/0xe4
[ 1081.926584] [] SyS_write+0x50/0x7e
[ 1081.927968] [] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [] schedule+0x7f/0x97
[ 1081.998017] [] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [] ? finish_wait+0x6d/0x76
[ 1082.000306] [] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [] down_write+0x43/0x57
[ 1082.005000] [] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [] ? __fget_light+0x4d/0x71
[ 1082.017789] [] SyS_ioctl+0x57/0x79
[ 1082.018706] [] entry_SYSCALL_64_fastpath+0x12/0x6f

This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.

Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.

Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2016-01-20 10:21:41 +0800

19 Jan, 2016

1 commit

c1a198d92 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs updates from Chris Mason:
"This has our usual assortment of fixes and cleanups, but the biggest
change included is Omar Sandoval's free space tree. It's not the
default yet, mounting -o space_cache=v2 enables it and sets a readonly
compat bit. The tree can actually be deleted and regenerated if there
are any problems, but it has held up really well in testing so far.

For very large filesystems (30T+) our existing free space caching code
can end up taking a huge amount of time during commits. The new tree
based code is faster and less work overall to update as the commit
progresses.

Omar worked on this during the summer and we'll hammer on it in
production here at FB over the next few months"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
Btrfs: fix fitrim discarding device area reserved for boot loader's use
Btrfs: Check metadata redundancy on balance
btrfs: statfs: report zero available if metadata are exhausted
btrfs: preallocate path for snapshot creation at ioctl time
btrfs: allocate root item at snapshot ioctl time
btrfs: do an allocation earlier during snapshot creation
btrfs: use smaller type for btrfs_path locks
btrfs: use smaller type for btrfs_path lowest_level
btrfs: use smaller type for btrfs_path reada
btrfs: cleanup, use enum values for btrfs_path reada
btrfs: constify static arrays
btrfs: constify remaining structs with function pointers
btrfs tests: replace whole ops structure for free space tests
btrfs: use list_for_each_entry* in backref.c
btrfs: use list_for_each_entry_safe in free-space-cache.c
btrfs: use list_for_each_entry* in check-integrity.c
Btrfs: use linux/sizes.h to represent constants
btrfs: cleanup, remove stray return statements
btrfs: zero out delayed node upon allocation
btrfs: pass proper enum type to start_transaction()
...

Linus Torvalds
2016-01-19 04:44:40 +0800

11 Jan, 2016

2 commits

b28cf5724 Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/kdave/linux into for-linus-4.5

Signed-off-by: Chris Mason <clm@fb.com>

Chris Mason
2016-01-11 22:08:37 +0800
a3058101c Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kda… ... Browse Code »

…ve/linux into for-linus-4.5

Chris Mason
2016-01-11 21:59:32 +0800

07 Jan, 2016

7 commits

4fb72bf2e btrfs: use smaller type for btrfs_path locks ... Browse Code »

The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:

* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:

text data bss dec hex filename
938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after

(and the generated assembly does not change much)

The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.

Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:01:17 +0800
7853f15b2 btrfs: use smaller type for btrfs_path lowest_level ... Browse Code »

The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:01:17 +0800
dccabfad2 btrfs: use smaller type for btrfs_path reada ... Browse Code »

The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:01:16 +0800
e4058b54d btrfs: cleanup, use enum values for btrfs_path reada ... Browse Code »

Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:01:15 +0800
4d4ab6d6b btrfs: constify static arrays ... Browse Code »

There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:01:15 +0800
ee22184b5 Btrfs: use linux/sizes.h to represent constants ... Browse Code »

We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee
Signed-off-by: David Sterba

Byongho Lee
2016-01-07 21:38:02 +0800
9780c4976 btrfs: switch __btrfs_fs_incompat return type from int to bool ... Browse Code »

Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.

Signed-off-by: Alexandru Moise
Signed-off-by: David Sterba

Alexandru Moise
2016-01-07 21:29:20 +0800

01 Jan, 2016

1 commit

2b3909f8a btrfs: use new dedupe data function pointer ... Browse Code »

Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.

Signed-off-by: Darrick J. Wong
Signed-off-by: Al Viro

Darrick J. Wong
2016-01-01 15:36:40 +0800

30 Dec, 2015

1 commit

511711af9 btrfs: don't run delayed references while we are creating the free space tree ... Browse Code »

This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.

Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.

Signed-off-by: Chris Mason

Chris Mason
2015-12-30 23:52:35 +0800

24 Dec, 2015

2 commits

f0f76413d Merge branch 'freespace-4.5' into for-linus-4.5 Browse Code »

Chris Mason
2015-12-24 05:29:09 +0800
afa427cf9 Merge branch 'cleanup/misc-simplify' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/kdave/linux into for-linus-4.5

Chris Mason
2015-12-24 05:10:26 +0800

19 Dec, 2015

1 commit

f7d3d2f99 Merge branch 'freespace-tree' into for-linus-4.5 ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2015-12-19 03:11:10 +0800

18 Dec, 2015

5 commits

70f6d82ec Btrfs: add free space tree mount option ... Browse Code »

Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.

Signed-off-by: Omar Sandoval
Signed-off-by: Chris Mason

Omar Sandoval
2015-12-18 04:16:47 +0800
a5ed91828 Btrfs: implement the free space B-tree ... Browse Code »

The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.

The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.

With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.

The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree and clear the free space tree.

Signed-off-by: Omar Sandoval
Signed-off-by: Chris Mason

Omar Sandoval
2015-12-18 04:16:47 +0800
208acb8c7 Btrfs: introduce the free space B-tree on-disk format ... Browse Code »

The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.

Reviewed-by: Josef Bacik
Signed-off-by: Omar Sandoval
Signed-off-by: Chris Mason

Omar Sandoval
2015-12-18 04:16:46 +0800
73fa48b67 Btrfs: refactor caching_thread() ... Browse Code »

We're also going to load the free space tree from caching_thread(), so
we should refactor some of the common code.

Signed-off-by: Omar Sandoval
Signed-off-by: Chris Mason

Omar Sandoval
2015-12-18 04:16:46 +0800
1abfbcdf5 Btrfs: add helpers for read-only compat bits ... Browse Code »

We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.
While we're add it, also add helpers to clear the bits.

Reviewed-by: Josef Bacik
Signed-off-by: Omar Sandoval
Signed-off-by: Chris Mason

Omar Sandoval
2015-12-18 04:16:46 +0800

08 Dec, 2015

1 commit

04b38d601 vfs: pull btrfs clone API to vfs layer ... Browse Code »

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development. To avoid growth of various slightly incompatible
implementations, add one to the VFS. Note that clones are different from
file copies in several ways:

- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure. The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-12-08 12:11:33 +0800

03 Dec, 2015

2 commits

304246013 btrfs: remove wait from struct btrfs_delalloc_work ... Browse Code »

The value is 0 and never changes, we can propagate the value, remove
wait and the implied dead code.

Signed-off-by: David Sterba

David Sterba
2015-12-03 22:02:21 +0800
651d494a6 btrfs: sink parameter wait to btrfs_alloc_delalloc_work ... Browse Code »

There's only one caller and single value, we can propagate it down to
the callee and remove the unused parameter.

Signed-off-by: David Sterba

David Sterba
2015-12-03 22:02:21 +0800

02 Dec, 2015

1 commit

3db11b2ee btrfs: add .copy_file_range file operation ... Browse Code »

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown
[Anna Schumaker: Make flags an unsigned int,
Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker
Reviewed-by: Josef Bacik
Reviewed-by: David Sterba
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Zach Brown
2015-12-02 03:00:54 +0800

28 Nov, 2015

1 commit

80e0c505b Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"This has Mark Fasheh's patches to fix quota accounting during subvol
deletion, which we've been working on for a while now. The patch is
pretty small but it's a key fix.

Otherwise it's a random assortment"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix balance range usage filters in 4.4-rc
btrfs: qgroup: account shared subtree during snapshot delete
Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
btrfs: qgroup: fix quota disable during rescan
Btrfs: fix race between cleaner kthread and space cache writeout
Btrfs: fix scrub preventing unused block groups from being deleted
Btrfs: fix race between scrub and block group deletion
btrfs: fix rcu warning during device replace
btrfs: Continue replace when set_block_ro failed
btrfs: fix clashing number of the enhanced balance usage filter
Btrfs: fix the number of transaction units needed to remove a block group
Btrfs: use global reserve when deleting unused block group after ENOSPC
Btrfs: tests: checking for NULL instead of IS_ERR()
btrfs: fix signed overflows in btrfs_sync_file

Linus Torvalds
2015-11-28 07:45:45 +0800

25 Nov, 2015

3 commits

758f2dfcf Btrfs: fix scrub preventing unused block groups from being deleted ... Browse Code »

Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:

CPU 1 CPU 2

cleaner kthread
btrfs_delete_unused_bgs()

gets block group X from
fs_info->unused_bgs and
removes it from that list

scrub_enumerate_chunks()

searches device tree using
its commit root

finds device extent for
block group X

gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())

sets bg X to RO

sees the block group is
already RO and therefore
doesn't delete it nor adds
it back to unused list

So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-11-25 21:22:08 +0800
7fd01182d Btrfs: fix the number of transaction units needed to remove a block group ... Browse Code »

We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.

While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.

However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.

In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:

http://www.spinics.net/lists/linux-btrfs/msg46070.html

So fix this by reserving all the necessary units of metadata.

Reported-by: Stefan Priebe
Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-11-25 21:19:50 +0800
8eab77ff1 Btrfs: use global reserve when deleting unused block group after ENOSPC ... Browse Code »

It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana
Signed-off-by: Jeff Mahoney
Signed-off-by: Chris Mason

Filipe Manana
2015-11-25 21:19:50 +0800

08 Nov, 2015

1 commit

ad804a0b2 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- most of the rest of MM

- procfs

- lib/ updates

- printk updates

- bitops infrastructure tweaks

- checkpatch updates

- nilfs2 update

- signals

- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton : (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...

Linus Torvalds
2015-11-08 06:32:45 +0800

07 Nov, 2015

1 commit

c62d25556 mm, fs: introduce mapping_gfp_constraint() ... Browse Code »

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-11-07 09:50:42 +0800

27 Oct, 2015

4 commits

5846a3c26 btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans ... Browse Code »

Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().

This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.

This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.

Reported-by: Filipe Manana
Signed-off-by: Qu Wenruo
Signed-off-by: Chris Mason

Qu Wenruo
2015-10-27 10:44:39 +0800
bc3094673 btrfs: extend balance filter usage to take minimum and maximum ... Browse Code »

Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.

Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2015-10-27 10:38:30 +0800
dee32d0ac btrfs: add balance filter for stripes ... Browse Code »

Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.

Signed-off-by: Gabríel Arthúr Pétursson
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba

Signed-off-by: Chris Mason

Gabríel Arthúr Pétursson
2015-10-27 10:38:29 +0800
12907fc79 btrfs: extend balance filter limit to take minimum and maximum ... Browse Code »

The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.

Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).

Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2015-10-27 10:38:28 +0800

26 Oct, 2015

1 commit

b06c4bf5c Btrfs: fix regression running delayed references when using qgroups ... Browse Code »

In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:

commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}

This happens on a scenario like the following:

1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.

4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.

5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.

7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.

8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.

9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).

So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").

The use of no_quota was also buggy in at least two places:

1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled

2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.

This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.

Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).

Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:

INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[] schedule+0x74/0x83
[] btrfs_tree_read_lock+0xc0/0xea
[] ? wait_woken+0x74/0x74
[] btrfs_search_old_slot+0x51a/0x810
[] btrfs_next_old_leaf+0xdf/0x3ce
[] ? ulist_add_merge+0x1b/0x127
[] __resolve_indirect_refs+0x62a/0x667
[] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[] find_parent_nodes+0xaf3/0xfc6
[] __btrfs_find_all_roots+0x92/0xf0
[] btrfs_find_all_roots+0x45/0x65
[] ? btrfs_get_tree_mod_seq+0x2b/0x88
[] check_ref+0x64/0xc4
[] btrfs_clone+0x66e/0xb5d
[] btrfs_ioctl_clone+0x48f/0x5bb
[] ? native_sched_clock+0x28/0x77
[] btrfs_ioctl+0xabc/0x25cb
(...)

The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).

Reported-by: Stéphane Lesimple
Tested-by: Stéphane Lesimple
Reported-by: Elias Probst
Reported-by: Peter Becker
Reported-by: Malte Schröder
Reported-by: Derek Dongray
Reported-by: Erkki Seppala
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana
Reviewed-by: Qu Wenruo

Filipe Manana
2015-10-26 03:53:26 +0800

22 Oct, 2015

2 commits

a9e6d1535 Merge branch 'allocator-fixes' into for-linus-4.4 ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2015-10-22 10:00:38 +0800
c759c4e16 Btrfs: don't keep trying to build clusters if we are fragmented ... Browse Code »

If we are extremely fragmented then we won't be able to create a free_cluster.
So if this happens set last_ptr->fragmented so that all future allcations will
give up trying to create a cluster. When we unpin extents we will unset
->fragmented if we free up a sufficient amount of space in a block group.
Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-10-22 09:55:39 +0800