27 Jul, 2020
1 commit
-
Instead of recording stripe_index and using that to access correct
btrfs_device from btrfs_bio::stripes record the btrfs_device in
btrfs_io_bio. This will enable endio handlers to increment device
error counters on checksum errors.Reviewed-by: Johannes Thumshirn
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
02 Jul, 2020
1 commit
-
Convert fall through comments to the pseudo-keyword which is now the
preferred way.Signed-off-by: Marcos Paulo de Souza
Reviewed-by: David Sterba
Signed-off-by: David Sterba
24 Mar, 2020
5 commits
-
Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.Reviewed-by: Johannes Thumshirn
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae44 ("btrfs: access
superblock via pagecache in scan_one_device").This patch unwraps the buffer head API and does not change the way the
super block is actually read.Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static. Also move it
above it's callers so we don't need a forward declaration.Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Preparatory patch for removal of buffer_head usage in btrfs.
Signed-off-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba
17 Feb, 2020
1 commit
-
Pull btrfs fixes from David Sterba:
"Two races fixed, memory leak fix, sysfs directory fixup and two new
log messages:- two fixed race conditions: extent map merging and truncate vs
fiemap- create the right sysfs directory with device information and move
the individual device dirs under it- print messages when the tree-log is replayed at mount time or
cannot be replayed on remount"* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs, move device id directories to UUID/devinfo
btrfs: sysfs, add UUID/devinfo kobject
Btrfs: fix race between shrinking truncate and fiemap
btrfs: log message when rw remount is attempted with unclean tree-log
btrfs: print message when tree-log replay starts
Btrfs: fix race between using extent maps and merging them
btrfs: ref-verify: fix memory leaks
13 Feb, 2020
1 commit
-
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba
29 Jan, 2020
1 commit
-
Pull btrfs updates from David Sterba:
"Features, highlights:- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)- device stats reset is logged with process name and PID that called
the ioctlFixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM- two more corner cases for metadata_uuid change after power loss
during the change- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items- new self test for physical -> logical mapping code, used for super
block range exclusion- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
24 Jan, 2020
2 commits
-
New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.in_fs_metadata - device is in the list of fs metadata
missing - device is missing (no device node or block device)
replace_target - device is target of replace
writeable - writes from fs are allowedThese attributes reflect the state of the device::dev_state and created
at mount time.Sample output:
$ pwd
/sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
$ ls
in_fs_metadata missing replace_target writeable
$ cat missing
0The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set. These attributes are readonly.It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.Note for device replace devid swap:
During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs. This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.Signed-off-by: Anand Jain
Reviewed-by: David Sterba
[ update changelog ]
Signed-off-by: David Sterba -
It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
20 Jan, 2020
2 commits
-
This is a leftover from recently removed bio scheduling framework.
Fixes: ba8a9d079543 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: Anand Jain
Signed-off-by: David Sterba -
The struct member btrfs_device::device_dir_kobj holds the kobj of the
sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from
device_dir_kobj to devices_kobj. No functional changes.Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba
08 Dec, 2019
1 commit
-
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Acked-by: David Sterba
Cc: Chris Mason
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar
19 Nov, 2019
5 commits
-
struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.Reviewed-by: Qu Wenruo
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.Reviewed-by: Qu Wenruo
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.Signed-off-by: David Sterba
-
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.Signed-off-by: David Sterba
-
The last user of btrfs_bio::flags was removed in commit 326e1dbb5736
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.(Tagged for stable as the structure is heavily used and space savings
are desirable.)CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
18 Nov, 2019
3 commits
-
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.Reviewed-by: Josef Bacik
Reviewed-by: Nikolay Borisov
Signed-off-by: Chris Mason
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.Reviewed-by: Josef Bacik
Reviewed-by: Nikolay Borisov
Signed-off-by: Chris Mason
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba
09 Sep, 2019
3 commits
-
btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
The status of flush bio is tracked as a status bit, changed in commit
1c3063b6dbfa ("btrfs: cleanup device states define
BTRFS_DEV_STATE_FLUSH_SENT"), the flush_bio_sent was forgotten.Reviewed-by: Anand Jain
Signed-off-by: David Sterba -
This function is only used locally in find_free_dev_extent(), no
external callers.So unexport it.
Reviewed-by: Johannes Thumshirn
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
02 Jul, 2019
2 commits
-
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
01 Jul, 2019
4 commits
-
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
The helper lacks the btrfs_ prefix and the parameter is the raw
blockgroup type, so none of the callers has to do the flags -> index
conversion.Signed-off-by: David Sterba
-
The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just
small numbers so we don't have to use ints. New size is 7 * 32 = 224,
saving 3 cachelines.Signed-off-by: David Sterba
-
fs_info::mapping_tree is the physicallogical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.Signed-off-by: David Sterba
08 May, 2019
1 commit
-
…kernel/git/gustavoars/linux
Pull Wimplicit-fallthrough updates from Gustavo A. R. Silva:
"Mark switch cases where we are expecting to fall through.This is part of the ongoing efforts to enable -Wimplicit-fallthrough.
Most of them have been baking in linux-next for a whole development
cycle. And with Stephen Rothwell's help, we've had linux-next
nag-emails going out for newly introduced code that triggers
-Wimplicit-fallthrough to avoid gaining more of these cases while we
work to remove the ones that are already present.We are getting close to completing this work. Currently, there are
only 32 of 2311 of these cases left to be addressed in linux-next. I'm
auditing every case; I take a look into the code and analyze it in
order to determine if I'm dealing with an actual bug or a false
positive, as explained here:https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/
While working on this, I've found and fixed the several missing
break/return bugs, some of them introduced more than 5 years ago.Once this work is finished, we'll be able to universally enable
"-Wimplicit-fallthrough" to avoid any of these kinds of bugs from
entering the kernel again"* tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
memstick: mark expected switch fall-throughs
drm/nouveau/nvkm: mark expected switch fall-throughs
NFC: st21nfca: Fix fall-through warnings
NFC: pn533: mark expected switch fall-throughs
block: Mark expected switch fall-throughs
ASN.1: mark expected switch fall-through
lib/cmdline.c: mark expected switch fall-throughs
lib: zstd: Mark expected switch fall-throughs
scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through
scsi: sym53c8xx_2: sym_hipd: mark expected switch fall-throughs
scsi: ppa: mark expected switch fall-through
scsi: osst: mark expected switch fall-throughs
scsi: lpfc: lpfc_scsi: Mark expected switch fall-throughs
scsi: lpfc: lpfc_nvme: Mark expected switch fall-through
scsi: lpfc: lpfc_nportdisc: Mark expected switch fall-through
scsi: lpfc: lpfc_hbadisc: Mark expected switch fall-throughs
scsi: lpfc: lpfc_els: Mark expected switch fall-throughs
scsi: lpfc: lpfc_ct: Mark expected switch fall-throughs
scsi: imm: mark expected switch fall-throughs
scsi: csiostor: csio_wr: mark expected switch fall-through
...
30 Apr, 2019
7 commits
-
We can read fs_info from the device and can drop it from the parameters.
Signed-off-by: David Sterba
-
We can read fs_info from the transaction and can drop it from the
parameters.Signed-off-by: David Sterba
-
Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.Signed-off-by: Jeff Mahoney
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
#PF error: [normal kernel read fault]
PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
Call Trace:
open_ctree+0x160d/0x2149
btrfs_mount_root+0x5b2/0x680[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.But in this case, there is no seed device at all, causing NULL pointer.
[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.Reported-by: Yoon Jungyeon
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov
Signed-off-by: Qu Wenruo
Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba