25 Jan, 2021

1 commit


17 Jan, 2021

1 commit

  • [ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

    Every time we log an inode we lookup in the fs/subvol tree for xattrs and
    if we have any, log them into the log tree. However it is very common to
    have inodes without any xattrs, so doing the search wastes times, but more
    importantly it adds contention on the fs/subvol tree locks, either making
    the logging code block and wait for tree locks or making the logging code
    making other concurrent operations block and wait.

    The most typical use cases where xattrs are used are when capabilities or
    ACLs are defined for an inode, or when SELinux is enabled.

    This change makes the logging code detect when an inode does not have
    xattrs and skip the xattrs search the next time the inode is logged,
    unless the inode is evicted and loaded again or a xattr is added to the
    inode. Therefore skipping the search for xattrs on inodes that don't ever
    have xattrs and are fsynced with some frequency.

    The following script that calls dbench was used to measure the impact of
    this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
    directly (no intermediary filesystem on the host) and using a non-debug
    kernel (default configuration on Debian distributions):

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/sdk
    MNT=/mnt/sdk
    MOUNT_OPTIONS="-o ssd"

    mkfs.btrfs -f -m single -d single $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    dbench -D $MNT -t 200 40

    umount $MNT

    The results before this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 5761605 0.172 312.057
    Close 4232452 0.002 10.927
    Rename 243937 1.406 277.344
    Unlink 1163456 0.631 298.402
    Deltree 160 11.581 221.107
    Mkdir 80 0.003 0.005
    Qpathinfo 5221410 0.065 122.309
    Qfileinfo 915432 0.001 3.333
    Qfsinfo 957555 0.003 3.992
    Sfileinfo 469244 0.023 20.494
    Find 2018865 0.448 123.659
    WriteX 2874851 0.049 118.529
    ReadX 9030579 0.004 21.654
    LockX 18754 0.003 4.423
    UnlockX 18754 0.002 0.331
    Flush 403792 10.944 359.494

    Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

    The results after this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 6442521 0.159 230.693
    Close 4732357 0.002 10.972
    Rename 272809 1.293 227.398
    Unlink 1301059 0.563 218.500
    Deltree 160 7.796 54.887
    Mkdir 80 0.008 0.478
    Qpathinfo 5839452 0.047 124.330
    Qfileinfo 1023199 0.001 4.996
    Qfsinfo 1070760 0.003 5.709
    Sfileinfo 524790 0.033 21.765
    Find 2257658 0.314 125.611
    WriteX 3211520 0.040 232.135
    ReadX 10098969 0.004 25.340
    LockX 20974 0.003 1.569
    UnlockX 20974 0.002 3.475
    Flush 451553 10.287 331.037

    Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

    +10.8% throughput, -8.2% max latency

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     

06 Nov, 2019

1 commit

  • Add a flag option to get xattr method that could have a bit flag of
    XATTR_NOSECURITY passed to it. XATTR_NOSECURITY is generally then
    set in the __vfs_getxattr path when called by security
    infrastructure.

    This handles the case of a union filesystem driver that is being
    requested by the security layer to report back the xattr data.

    For the use case where access is to be blocked by the security layer.

    The path then could be security(dentry) ->
    __vfs_getxattr(dentry...XATTR_NOSECURITY) ->
    handler->get(dentry...XATTR_NOSECURITY) ->
    __vfs_getxattr(lower_dentry...XATTR_NOSECURITY) ->
    lower_handler->get(lower_dentry...XATTR_NOSECURITY)
    which would report back through the chain data and success as
    expected, the logging security layer at the top would have the
    data to determine the access permissions and report back the target
    context that was blocked.

    Without the get handler flag, the path on a union filesystem would be
    the errant security(dentry) -> __vfs_getxattr(dentry) ->
    handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested ->
    security(lower_dentry, log off) -> lower_handler->get(lower_dentry)
    which would report back through the chain no data, and -EACCES.

    For selinux for both cases, this would translate to a correctly
    determined blocked access. In the first case with this change a correct avc
    log would be reported, in the second legacy case an incorrect avc log
    would be reported against an uninitialized u:object_r:unlabeled:s0
    context making the logs cosmetically useless for audit2allow.

    This patch series is inert and is the wide-spread addition of the
    flags option for xattr functions, and a replacement of __vfs_getxattr
    with __vfs_getxattr(...XATTR_NOSECURITY).

    Signed-off-by: Mark Salyzyn
    Reviewed-by: Jan Kara
    Acked-by: Jan Kara
    Acked-by: Jeff Layton
    Acked-by: David Sterba
    Acked-by: Darrick J. Wong
    Acked-by: Mike Marshall
    Cc: Stephen Smalley
    Cc: linux-kernel@vger.kernel.org
    Cc: kernel-team@android.com
    Cc: linux-security-module@vger.kernel.org

    (cherry picked from (rejected from archive because of too many recipients))
    Signed-off-by: Mark Salyzyn
    Bug: 133515582
    Bug: 136124883
    Bug: 129319403
    Change-Id: Iabbb8771939d5f66667a26bb23ddf4c562c349a1

    Mark Salyzyn
     

17 Jun, 2019

1 commit

  • After the recent series of cleanups in the properties and xattrs modules
    that landed in the 5.2 merge window, we ended up with a regression where
    after deleting the compression xattr property through the setflags ioctl,
    we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore.
    As a consequence, if the inode was fsync'ed when it had the compression
    property set, after deleting the compression property through the setflags
    ioctl and fsync'ing again the inode, the log will still contain the
    compression xattr, because the inode did not had that bit set, which
    made the fsync not delete all xattrs from the log and copy all xattrs
    from the subvolume tree to the log tree.

    This regression happens due to the fact that that series of cleanups
    made btrfs_set_prop() call the old function do_setxattr() (which is now
    named btrfs_setxattr()), and not the old version of btrfs_setxattr(),
    which is now called btrfs_setxattr_trans().

    Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current
    btrfs_setxattr() function and remove it from everywhere else, including
    its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar
    regressions in the future, and centralizes the setup of the bit. After
    all, the need to setup this bit should only be in the xattrs module,
    since it is an implementation of xattrs.

    Fixes: 04e6863b19c722 ("btrfs: split btrfs_setxattr calls regarding transaction")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

30 Apr, 2019

14 commits


17 Dec, 2018

1 commit


12 Apr, 2018

1 commit


26 Mar, 2018

2 commits


30 Jan, 2018

1 commit

  • Pull btrfs updates from David Sterba:
    "Features or user visible changes:

    - fallocate: implement zero range mode

    - avoid losing data raid profile when deleting a device

    - tree item checker: more checks for directory items and xattrs

    Notable fixes:

    - raid56 recovery: don't use cached stripes, that could be
    potentially changed and a later RMW or recovery would lead to
    corruptions or failures

    - let raid56 try harder to rebuild damaged data, reading from all
    stripes if necessary

    - fix scrub to repair raid56 in a similar way as in the case above

    Other:

    - cleanups: device freeing, removed some call indirections, redundant
    bio_put/_get, unused parameters, refactorings and renames

    - RCU list traversal fixups

    - simplify mount callchain, remove recursing back when mounting a
    subvolume

    - plug for fsync, may improve bio merging on multiple devices

    - compression heurisic: replace heap sort with radix sort, gains some
    performance

    - add extent map selftests, buffered write vs dio"

    * tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
    btrfs: drop devid as device_list_add() arg
    btrfs: get device pointer from device_list_add()
    btrfs: set the total_devices in device_list_add()
    btrfs: move pr_info into device_list_add
    btrfs: make btrfs_free_stale_devices() to match the path
    btrfs: rename btrfs_free_stale_devices() arg to skip_dev
    btrfs: make btrfs_free_stale_devices() argument optional
    btrfs: make btrfs_free_stale_device() to iterate all stales
    btrfs: no need to check for btrfs_fs_devices::seeding
    btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
    Btrfs: noinline merge_extent_mapping
    Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
    Btrfs: extent map selftest: dio write vs dio read
    Btrfs: extent map selftest: buffered write vs dio read
    Btrfs: add extent map selftests
    Btrfs: move extent map specific code to extent_map.c
    Btrfs: add helper for em merge logic
    Btrfs: fix unexpected EEXIST from btrfs_get_extent
    Btrfs: fix incorrect block_len in merge_extent_mapping
    btrfs: Remove unused readahead spinlock
    ...

    Linus Torvalds
     

29 Jan, 2018

1 commit

  • Add a documentation blob that explains what the i_version field is, how
    it is expected to work, and how it is currently implemented by various
    filesystems.

    We already have inode_inc_iversion. Add several other functions for
    manipulating and accessing the i_version counter. For now, the
    implementation is trivial and basically works the way that all of the
    open-coded i_version accesses work today.

    Future patches will convert existing users of i_version to use the new
    API, and then convert the backend implementation to do things more
    efficiently.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

22 Jan, 2018

1 commit

  • Since tree-checker has verified leaf when reading from disk, we don't
    need the existing verify_dir_item() or btrfs_is_name_len_valid() checks.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     

22 Jun, 2017

1 commit

  • Originally, verify_dir_item verifies name_len of dir_item with fixed
    values but not item boundary.
    If corrupted name_len was not bigger than the fixed value, for example
    255, the function will think the dir_item is fine. And then reading
    beyond boundary will cause crash.

    Example:
    1. Corrupt one dir_item name_len to be 255.
    2. Run 'ls -lar /mnt/test/ > /dev/null'
    dmesg:
    [ 48.451449] BTRFS info (device vdb1): disk space caching is enabled
    [ 48.451453] BTRFS info (device vdb1): has skinny extents
    [ 48.489420] general protection fault: 0000 [#1] SMP
    [ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
    [ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
    [ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    [ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
    [ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
    [ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
    [ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
    [ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
    [ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
    [ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
    [ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
    [ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
    [ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
    [ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 48.490008] Call Trace:
    [ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
    [ 48.490008] iterate_dir+0x181/0x1b0
    [ 48.490008] SyS_getdents+0xa7/0x150
    [ 48.490008] ? fillonedir+0x150/0x150
    [ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad
    [ 48.490008] RIP: 0033:0x7fef5032546b
    [ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
    [ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
    [ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
    [ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
    [ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
    [ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
    [ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
    [ 48.499455] ---[ end trace 321920d8e8339505 ]---

    Fix it by adding a parameter @slot and check name_len with item boundary
    by calling btrfs_is_name_len_valid.

    Signed-off-by: Su Yue
    rev
    Signed-off-by: David Sterba

    Su Yue
     

14 Feb, 2017

2 commits

  • This goes as a separate patch because fixing that inside the patches
    caused too many many conflicts.

    Signed-off-by: David Sterba

    David Sterba
     
  • Currently btrfs_ino takes a struct inode and this causes a lot of
    internal btrfs functions which consume this ino to take a VFS inode,
    rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
    of VFS structs into the internals of btrfs first it's necessary to
    eliminate all uses of struct inode for the purpose of inode. This patch
    does that by using BTRFS_I to convert an inode to btrfs_inode. With
    this problem eliminated subsequent patches will start eliminating the
    passing of struct inode altogether, eventually resulting in a lot cleaner
    code.

    Signed-off-by: Nikolay Borisov
    [ fix btrfs_get_extent tracepoint prototype ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     

06 Dec, 2016

3 commits


28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

28 May, 2016

1 commit


18 May, 2016

1 commit

  • The btrfs_{set,remove}xattr inode operations check for a read-only root
    (btrfs_root_readonly) before calling into generic_{set,remove}xattr. If
    this check is moved into __btrfs_setxattr, we can get rid of
    btrfs_{set,remove}xattr.

    This patch applies to mainline, I would like to keep it together with
    the other xattr cleanups if possible, though. Could you please review?

    Thanks,
    Andreas

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

11 Apr, 2016

1 commit


02 Mar, 2016

1 commit

  • In the listxattrs handler, we were not listing all the xattrs that are
    packed in the same btree item, which happens when multiple xattrs have
    a name that when crc32c hashed produce the same checksum value.

    Fix this by processing them all.

    The following test case for xfstests reproduces the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    cd /
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/attr

    # real QA test starts here
    _supported_fs generic
    _supported_os Linux
    _require_scratch
    _require_attrs

    rm -f $seqres.full

    _scratch_mkfs >>$seqres.full 2>&1
    _scratch_mount

    # Create our test file with a few xattrs. The first 3 xattrs have a name
    # that when given as input to a crc32c function result in the same checksum.
    # This made btrfs list only one of the xattrs through listxattrs system call
    # (because it packs xattrs with the same name checksum into the same btree
    # item).
    touch $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile

    # Now call getfattr with --dump, which calls the listxattrs system call.
    # It should list all the xattrs we have set before.
    $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch

    status=0
    exit

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

18 Feb, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_fs_time() instead.

    Signed-off-by: Deepa Dinamani
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: linux-btrfs@vger.kernel.org
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Deepa Dinamani
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

19 Jan, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "This has our usual assortment of fixes and cleanups, but the biggest
    change included is Omar Sandoval's free space tree. It's not the
    default yet, mounting -o space_cache=v2 enables it and sets a readonly
    compat bit. The tree can actually be deleted and regenerated if there
    are any problems, but it has held up really well in testing so far.

    For very large filesystems (30T+) our existing free space caching code
    can end up taking a huge amount of time during commits. The new tree
    based code is faster and less work overall to update as the commit
    progresses.

    Omar worked on this during the summer and we'll hammer on it in
    production here at FB over the next few months"

    * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
    Btrfs: fix fitrim discarding device area reserved for boot loader's use
    Btrfs: Check metadata redundancy on balance
    btrfs: statfs: report zero available if metadata are exhausted
    btrfs: preallocate path for snapshot creation at ioctl time
    btrfs: allocate root item at snapshot ioctl time
    btrfs: do an allocation earlier during snapshot creation
    btrfs: use smaller type for btrfs_path locks
    btrfs: use smaller type for btrfs_path lowest_level
    btrfs: use smaller type for btrfs_path reada
    btrfs: cleanup, use enum values for btrfs_path reada
    btrfs: constify static arrays
    btrfs: constify remaining structs with function pointers
    btrfs tests: replace whole ops structure for free space tests
    btrfs: use list_for_each_entry* in backref.c
    btrfs: use list_for_each_entry_safe in free-space-cache.c
    btrfs: use list_for_each_entry* in check-integrity.c
    Btrfs: use linux/sizes.h to represent constants
    btrfs: cleanup, remove stray return statements
    btrfs: zero out delayed node upon allocation
    btrfs: pass proper enum type to start_transaction()
    ...

    Linus Torvalds
     

11 Jan, 2016

1 commit