27 Jul, 2020

1 commit

  • Instead of recording stripe_index and using that to access correct
    btrfs_device from btrfs_bio::stripes record the btrfs_device in
    btrfs_io_bio. This will enable endio handlers to increment device
    error counters on checksum errors.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

02 Jul, 2020

1 commit


24 Mar, 2020

5 commits

  • Introduce chunk allocation policy for btrfs. This policy controls how
    chunks and device extents are allocated from devices.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Signed-off-by: Naohiro Aota
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Naohiro Aota
     
  • It's used only during filesystem mount as such it can be made private to
    disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
    as btrfs_check_uuid_tree is its sole caller.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
    have some drawbacks, like not being able to propagate errors from the
    lower layers.

    Directly use the page cache for reading the super blocks from disk or
    invalidating an on-disk super block. We have to use the page cache so to
    avoid races between mkfs and udev. See also 6f60cbd3ae44 ("btrfs: access
    superblock via pagecache in scan_one_device").

    This patch unwraps the buffer head API and does not change the way the
    super block is actually read.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
    remove it from the header file and mark it as static. Also move it
    above it's callers so we don't need a forward declaration.

    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Preparatory patch for removal of buffer_head usage in btrfs.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

17 Feb, 2020

1 commit

  • Pull btrfs fixes from David Sterba:
    "Two races fixed, memory leak fix, sysfs directory fixup and two new
    log messages:

    - two fixed race conditions: extent map merging and truncate vs
    fiemap

    - create the right sysfs directory with device information and move
    the individual device dirs under it

    - print messages when the tree-log is replayed at mount time or
    cannot be replayed on remount"

    * tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: sysfs, move device id directories to UUID/devinfo
    btrfs: sysfs, add UUID/devinfo kobject
    Btrfs: fix race between shrinking truncate and fiemap
    btrfs: log message when rw remount is attempted with unclean tree-log
    btrfs: print message when tree-log replay starts
    Btrfs: fix race between using extent maps and merging them
    btrfs: ref-verify: fix memory leaks

    Linus Torvalds
     

13 Feb, 2020

1 commit


29 Jan, 2020

1 commit

  • Pull btrfs updates from David Sterba:
    "Features, highlights:

    - async discard
    - "mount -o discard=async" to enable it
    - freed extents are not discarded immediatelly, but grouped
    together and trimmed later, with IO rate limiting
    - the "sync" mode submits short extents that could have been
    ignored completely by the device, for SATA prior to 3.1 the
    requests are unqueued and have a big impact on performance
    - the actual discard IO requests have been moved out of
    transaction commit to a worker thread, improving commit latency
    - IO rate and request size can be tuned by sysfs files, for now
    enabled only with CONFIG_BTRFS_DEBUG as we might need to
    add/delete the files and don't have a stable-ish ABI for
    general use, defaults are conservative

    - export device state info in sysfs, eg. missing, writeable

    - no discard of extents known to be untouched on disk (eg. after
    reservation)

    - device stats reset is logged with process name and PID that called
    the ioctl

    Fixes:

    - fix missing hole after hole punching and fsync when using NO_HOLES

    - writeback: range cyclic mode could miss some dirty pages and lead
    to OOM

    - two more corner cases for metadata_uuid change after power loss
    during the change

    - fix infinite loop during fsync after mix of rename operations

    Core changes:

    - qgroup assign returns ENOTCONN when quotas not enabled, used to
    return EINVAL that was confusing

    - device closing does not need to allocate memory anymore

    - snapshot aware code got removed, disabled for years due to
    performance problems, reimplmentation will allow to select wheter
    defrag breaks or does not break COW on shared extents

    - tree-checker:
    - check leaf chunk item size, cross check against number of
    stripes
    - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items

    - new self test for physical -> logical mapping code, used for super
    block range exclusion

    - assertion helpers/macros updated to avoid objtool "unreachable
    code" reports on older compilers or config option combinations"

    * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
    btrfs: free block groups after free'ing fs trees
    btrfs: Fix split-brain handling when changing FSID to metadata uuid
    btrfs: Handle another split brain scenario with metadata uuid feature
    btrfs: Factor out metadata_uuid code from find_fsid.
    btrfs: Call find_fsid from find_fsid_inprogress
    Btrfs: fix infinite loop during fsync after rename operations
    btrfs: set trans->drity in btrfs_commit_transaction
    btrfs: drop log root for dropped roots
    btrfs: sysfs, add devid/dev_state kobject and device attributes
    btrfs: Refactor btrfs_rmap_block to improve readability
    btrfs: Add self-tests for btrfs_rmap_block
    btrfs: selftests: Add support for dummy devices
    btrfs: Move and unexport btrfs_rmap_block
    btrfs: separate definition of assertion failure handlers
    btrfs: device stats, log when stats are zeroed
    btrfs: fix improper setting of scanned for range cyclic write cache pages
    btrfs: safely advance counter when looking up bio csums
    btrfs: remove unused member btrfs_device::work
    btrfs: remove unnecessary wrapper get_alloc_profile
    btrfs: add correction to handle -1 edge case in async discard
    ...

    Linus Torvalds
     

24 Jan, 2020

2 commits

  • New sysfs attributes that track the filesystem status of devices, stored
    in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
    a directory for each device, with name corresponding to the numerical
    device id.

    in_fs_metadata - device is in the list of fs metadata
    missing - device is missing (no device node or block device)
    replace_target - device is target of replace
    writeable - writes from fs are allowed

    These attributes reflect the state of the device::dev_state and created
    at mount time.

    Sample output:
    $ pwd
    /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
    $ ls
    in_fs_metadata missing replace_target writeable
    $ cat missing
    0

    The output from these attributes are 0 or 1. 0 indicates unset and 1
    indicates set. These attributes are readonly.

    It is observed that the device delete thread and sysfs read thread will
    not race because the delete thread calls sysfs kobject_put() which in
    turn waits for existing sysfs read to complete.

    Note for device replace devid swap:

    During the replace the target device temporarily assumes devid 0 before
    assigning the devid of the soruce device.

    In btrfs_dev_replace_finishing() we remove source sysfs devid using the
    function btrfs_sysfs_remove_devices_attr(), so after that call
    kobject_rename() to update the devid in the sysfs. This adds and calls
    btrfs_sysfs_update_devid() helper function to update the device id.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba

    Anand Jain
     
  • It's used only during initial block group reading to map physical
    address of super block to a list of logical ones. Make it private to
    block-group.c, add proper kernel doc and ensure it's exported only for
    tests.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

20 Jan, 2020

2 commits


08 Dec, 2019

1 commit

  • CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
    Both PREEMPT and PREEMPT_RT require the same functionality which today
    depends on CONFIG_PREEMPT.

    Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Acked-by: David Sterba
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: linux-btrfs@vger.kernel.org
    Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

19 Nov, 2019

5 commits

  • struct btrfs_fs_devices::rotating currently is declared as an integer
    variable but only used as a boolean.

    Change the variable definition to bool and update to code touching it to
    set 'true' and 'false'.

    Reviewed-by: Qu Wenruo
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • struct btrfs_fs_devices::seeding currently is declared as an integer
    variable but only used as a boolean.

    Change the variable definition to bool and update to code touching it to
    set 'true' and 'false'.

    Reviewed-by: Qu Wenruo
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Add new block group profile to store 4 copies in a simliar way that
    current RAID1 does. The profile attributes and constraints are defined
    in the raid table and used by the same code that already handles the 2-
    and 3-copy RAID1.

    The minimum number of devices is 4, the maximum number of devices/chunks
    that can be lost/damaged is 3. There is no comparable traditional RAID
    level, the profile is added for future needs to accompany triple-parity
    and beyond.

    Signed-off-by: David Sterba

    David Sterba
     
  • Add new block group profile to store 3 copies in a simliar way that
    current RAID1 does. The profile attributes and constraints are defined
    in the raid table and used by the same code that already handles the
    2-copy RAID1.

    The minimum number of devices is 3, the maximum number of devices/chunks
    that can be lost/damaged is 2. Like RAID6 but with 33% space
    utilization.

    Signed-off-by: David Sterba

    David Sterba
     
  • The last user of btrfs_bio::flags was removed in commit 326e1dbb5736
    ("block: remove management of bi_remaining when restoring original
    bi_end_io"), remove it.

    (Tagged for stable as the structure is heavily used and space savings
    are desirable.)

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     

18 Nov, 2019

3 commits

  • Now that we're not using btrfs_schedule_bio() anymore, delete all the
    code that supported it.

    Reviewed-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     
  • btrfs_schedule_bio() hands IO off to a helper thread to do the actual
    submit_bio() call. This has been used to make sure async crc and
    compression helpers don't get stuck on IO submission. To maintain good
    performance, over time the IO submission threads duplicated some IO
    scheduler characteristics such as high and low priority IOs and they
    also made some ugly assumptions about request allocation batch sizes.

    All of this cost at least one extra context switch during IO submission,
    and doesn't fit well with the modern blkmq IO stack. So, this commit stops
    using btrfs_schedule_bio(). We may need to adjust the number of async
    helper threads for crcs and compression, but long term it's a better
    path.

    Reviewed-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     
  • For some reason the attribute is called __attribute_const__ and not
    __const, marks functions that have no observable effects on program
    state, IOW not reading pointers, just the arguments and calculating a
    value. Allows the compiler to do some optimizations, based on
    -Wsuggest-attribute=const . The effects are rather small, though, about
    60 bytes decrese of btrfs.ko.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     

09 Sep, 2019

3 commits


02 Jul, 2019

2 commits

  • Presently btrfs_map_block is used not only to do everything necessary to
    map a bio to the underlying allocation profile but it's also used to
    identify how much data could be written based on btrfs' stripe logic
    without actually submitting anything. This is achieved by passing NULL
    for 'bbio_ret' parameter.

    This patch refactors all callers that require just the mapping length
    by switching them to using btrfs_io_geometry instead of calling
    btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
    change.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Add a structure that holds various parameters for IO calculations and a
    helper that fills the values. This will help further refactoring and
    reduction of functions that in some way open-coded the calculations.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

01 Jul, 2019

4 commits


08 May, 2019

1 commit

  • …kernel/git/gustavoars/linux

    Pull Wimplicit-fallthrough updates from Gustavo A. R. Silva:
    "Mark switch cases where we are expecting to fall through.

    This is part of the ongoing efforts to enable -Wimplicit-fallthrough.

    Most of them have been baking in linux-next for a whole development
    cycle. And with Stephen Rothwell's help, we've had linux-next
    nag-emails going out for newly introduced code that triggers
    -Wimplicit-fallthrough to avoid gaining more of these cases while we
    work to remove the ones that are already present.

    We are getting close to completing this work. Currently, there are
    only 32 of 2311 of these cases left to be addressed in linux-next. I'm
    auditing every case; I take a look into the code and analyze it in
    order to determine if I'm dealing with an actual bug or a false
    positive, as explained here:

    https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/

    While working on this, I've found and fixed the several missing
    break/return bugs, some of them introduced more than 5 years ago.

    Once this work is finished, we'll be able to universally enable
    "-Wimplicit-fallthrough" to avoid any of these kinds of bugs from
    entering the kernel again"

    * tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
    memstick: mark expected switch fall-throughs
    drm/nouveau/nvkm: mark expected switch fall-throughs
    NFC: st21nfca: Fix fall-through warnings
    NFC: pn533: mark expected switch fall-throughs
    block: Mark expected switch fall-throughs
    ASN.1: mark expected switch fall-through
    lib/cmdline.c: mark expected switch fall-throughs
    lib: zstd: Mark expected switch fall-throughs
    scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through
    scsi: sym53c8xx_2: sym_hipd: mark expected switch fall-throughs
    scsi: ppa: mark expected switch fall-through
    scsi: osst: mark expected switch fall-throughs
    scsi: lpfc: lpfc_scsi: Mark expected switch fall-throughs
    scsi: lpfc: lpfc_nvme: Mark expected switch fall-through
    scsi: lpfc: lpfc_nportdisc: Mark expected switch fall-through
    scsi: lpfc: lpfc_hbadisc: Mark expected switch fall-throughs
    scsi: lpfc: lpfc_els: Mark expected switch fall-throughs
    scsi: lpfc: lpfc_ct: Mark expected switch fall-throughs
    scsi: imm: mark expected switch fall-throughs
    scsi: csiostor: csio_wr: mark expected switch fall-through
    ...

    Linus Torvalds
     

30 Apr, 2019

7 commits

  • We can read fs_info from the device and can drop it from the parameters.

    Signed-off-by: David Sterba

    David Sterba
     
  • We can read fs_info from the transaction and can drop it from the
    parameters.

    Signed-off-by: David Sterba

    David Sterba
     
  • Now that these functions no longer require a handle to transaction to
    inspect pending/pinned chunks the argument can be removed. At the same
    time also remove any surrounding code which acquired the handle.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The pending chunks list contains chunks that are allocated in the
    current transaction but haven't been created yet. The pinned chunks
    list contains chunks that are being released in the current transaction.
    Both describe chunks that are not reflected on disk as in use but are
    unavailable just the same.

    The pending chunks list is anchored by the transaction handle, which
    means that we need to hold a reference to a transaction when working
    with the list.

    The way we use them is by iterating over both lists to perform
    comparisons on the stripes they describe for each device. This is
    backwards and requires that we keep a transaction handle open while
    we're trimming.

    This patchset adds an extent_io_tree to btrfs_device that maintains
    the allocation state of the device. Extents are set dirty when
    chunks are first allocated -- when the extent maps are added to the
    mapping tree. They're cleared when last removed -- when the extent
    maps are removed from the mapping tree. This matches the lifespan
    of the pending and pinned chunks list and allows us to do trims
    on unallocated space safely without pinning the transaction for what
    may be a lengthy operation. We can also use this io tree to mark
    which chunks have already been trimmed so we don't repeat the operation.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • btrfs_device structs are freed from RCU context since device iteration
    is protected by RCU. Currently this is achieved by using call_rcu since
    no blocking functions are called within btrfs_free_device. Future
    refactoring of pending/pinned chunks will require calling sleeping
    functions.

    This patch is in preparation for these changes by simply switching from
    RCU callbacks to explicit calls of synchronize_rcu and calling
    btrfs_free_device directly. This is functionally equivalent, making sure
    that there are no readers at that time.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • We currently overload the pending_chunks list to handle updating
    btrfs_device->commit_bytes used. We don't actually care about the
    extent mapping or even the device mapping for the chunk - we just need
    the device, and we can end up processing it multiple times. The
    fs_devices->resized_list does more or less the same thing, but with the
    disk size. They are called consecutively during commit and have more or
    less the same purpose.

    We can combine the two lists into a single list that attaches to the
    transaction and contains a list of devices that need updating. Since we
    always add the device to a list when we change bytes_used or
    disk_total_size, there's no harm in copying both values at once.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • [BUG]
    For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
    kernel will just panic:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
    #PF error: [normal kernel read fault]
    PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
    RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
    Call Trace:
    open_ctree+0x160d/0x2149
    btrfs_mount_root+0x5b2/0x680

    [CAUSE]
    If device extent verification finds a deivce with 0 total_bytes, then it
    assumes it's a seed dummy, then search for seed devices.

    But in this case, there is no seed device at all, causing NULL pointer.

    [FIX]
    Since this is caused by fuzzed image, let's go the tree-check way, just
    add a new verification for device item.

    Reported-by: Yoon Jungyeon
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Qu Wenruo
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Qu Wenruo