09 Jun, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     

31 May, 2018

2 commits

  • The function return values are confusing with the way the function is
    named. We expect a true or false return value but it actually returns
    0/-errno. This makes the code very confusing. Changing the return values
    to return a bool where if DAX is supported then return true and no DAX
    support returns false.

    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Jiang
     
  • Change bdev_dax_supported so it takes a bdev parameter. This enables
    multi-device filesystems like xfs to check that a dax device can work for
    the particular filesystem. Once that's in place, actually fix all the
    parts of XFS where we need to be able to distinguish between datadev and
    rtdev.

    This patch fixes the problem where we screw up the dax support checking
    in xfs if the datadev and rtdev have different dax capabilities.

    Signed-off-by: Darrick J. Wong
    [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
    Signed-off-by: Ross Zwisler
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     

22 May, 2018

2 commits

  • When xfs is operating as the back-end of a pNFS block server, it
    prevents collisions between local and remote operations by requiring a
    lease to be held for remotely accessed blocks. Local filesystem
    operations break those leases before writing or mutating the extent map
    of the file.

    A similar mechanism is needed to prevent operations on pinned dax
    mappings, like device-DMA, from colliding with extent unmap operations.

    BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
    layout breaking.

    Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
    do not collide with local writes. Additionally, layouts are broken in
    the BREAK_UNMAP case to make sure the layout-holder has a consistent
    view of the file's extent map. While BREAK_WRITE breaks can be satisfied
    be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
    waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

    After this refactoring xfs_break_layouts() becomes the entry point for
    coordinating both types of breaks. Finally, xfs_break_leased_layouts()
    becomes just the BREAK_WRITE handler.

    Note that the unlock tracking is needed in a follow on change. That will
    coordinate retrying either break handler until both successfully test
    for a lease break while maintaining the lock state.

    Cc: Ross Zwisler
    Cc: "Darrick J. Wong"
    Reported-by: Dave Chinner
    Reported-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for adding coordination between extent unmap operations
    and busy dax-pages, update xfs_break_layouts() to permit it to be called
    with the mmap lock held. This lock scheme will be required for
    coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
    pages mapped into the file's address space). Breaking dax layouts will
    be added to xfs_break_layouts() in a future patch, for now this preps
    the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
    xfs_break_layouts().

    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Cc: Dave Chinner
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: "Darrick J. Wong"
    Signed-off-by: Dan Williams

    Dan Williams
     

16 May, 2018

1 commit

  • The GET ioctl is trivial, just return the current label.

    The SET ioctl is more involved:
    It transactionally modifies the superblock to write a new filesystem
    label to the primary super.

    A new variant of xfs_sync_sb then writes the superblock buffer
    immediately to disk so that the change is visible from userspace.

    It then invalidates any page cache that userspace might have previously
    read on the block device so that i.e. blkid can see the change
    immediately, and updates all secondary superblocks as userspace relable
    does.

    Signed-off-by: Eric Sandeen
    [darrick: use dchinner's new xfs_update_secondary_sbs function]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

09 Jan, 2018

2 commits


10 Nov, 2017

1 commit


27 Oct, 2017

3 commits

  • Scrub the fields within an inode.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • Create an ioctl that can be used to scrub internal filesystem metadata.
    The new ioctl takes the metadata type, an (optional) AG number, an
    (optional) inode number and generation, and a flags argument. This will
    be used by the upcoming XFS online scrub tool.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • Instead of passing in a formatter callback allocate the bmap buffer
    in the caller and process the entries there. Additionally replace
    the in-kernel buffer with a new much smaller structure, and unify
    the implementation of the different ioctls in a single function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

27 Sep, 2017

1 commit

  • Currently only the blocksize is checked, but we should really be calling
    bdev_dax_supported() which also tests to make sure we can get a
    struct dax_device and that the dax_direct_access() path is working.

    This is the same check that we do for the "-o dax" mount option in
    xfs_fs_fill_super().

    This does not fix the race issues that caused the XFS DAX inode option to
    be disabled, so that option will still be disabled. If/when we re-enable
    it, though, I think we will want this issue to have been fixed. I also do
    think that we want to fix this in stable kernels.

    Signed-off-by: Ross Zwisler
    CC: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Ross Zwisler
     

02 Sep, 2017

2 commits


28 Jun, 2017

1 commit

  • Remove the xfs_etest structure in favor of a per-mountpoint structure.
    This will give us the flexibility to set as many error injection points
    as we want, and later enable us to set up sysfs knobs to set the trigger
    frequency as we wish. This comes at a cost of higher memory use, but
    unti we hit 1024 injection points (we're at 29) or a lot of mounts this
    shouldn't be a huge issue.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Carlos Maiolino

    Darrick J. Wong
     

20 Jun, 2017

1 commit

  • This is a purely mechanical patch that removes the private
    __{u,}int{8,16,32,64}_t typedefs in favor of using the system
    {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
    the transformation and fix the resulting whitespace and indentation
    errors:

    s/typedef\t__uint8_t/typedef __uint8_t\t/g
    s/typedef\t__uint/typedef __uint/g
    s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
    s/__uint8_t\t/__uint8_t\t\t/g
    s/__uint/uint/g
    s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
    s/__int/int/g
    /^typedef.*int[0-9]*_t;$/d

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

19 Jun, 2017

1 commit

  • XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
    Given that handle_t always only had two sizes, and one of them isn't
    even covered by XFS_HSIZE to start with just remove the macro and use
    a constant sizeof expression.

    Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

26 Apr, 2017

3 commits

  • At the end of a getfsmap call, we will set FMR_OF_LAST in the last
    struct fsmap that was handed in by userspace if we've truly run out of
    space mapping record (as opposed to simply running out of space in the
    user array). Unfortunately, fmh_entries is the wrong check for whether
    or not we've filled out anything in the user array because the ioctl
    provides that fmh_count==0 sets fmh_entries without filling out the user
    array. Therefore we end up writing things into user memory areas that we
    weren't given, and kaboom.

    Since Christoph amended the getfsmap structure to track the number of
    fsmap entries we've actually filled out, use that as part of deciding if
    we have to set the OF_LAST flag.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • By passing the whole fsmap_head structure and an index we can get the
    user point annotations right for the embedded variable sized array
    in struct fsmap_head.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: change idx to unsigned int]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Found by sparse.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

04 Apr, 2017

2 commits


02 Mar, 2017

1 commit


31 Jan, 2017

1 commit


18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     

09 Dec, 2016

1 commit


30 Nov, 2016

1 commit

  • This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
    recently replaced i_mutex instead. This means we only have to take
    one lock instead of two in many fast path operations, and we can
    also shrink the xfs_inode structure. Thanks to the xfs_ilock family
    there is very little churn, the only thing of note is that we need
    to switch to use the lock_two_directory helper for taking the i_rwsem
    on two inodes in a few places to make sure our lock order matches
    the one used in the VFS.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

08 Nov, 2016

1 commit

  • The open-coded pattern:

    ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

    is all over the xfs code; provide a new helper
    xfs_iext_count(ifp) to count the number of inline extents
    in an inode fork.

    [dchinner: pick up several missed conversions]

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

14 Oct, 2016

1 commit

  • …kernel/git/dgc/linux-xfs

    < XFS has gained super CoW powers! >
    ----------------------------------
    \ ^__^
    \ (oo)\_______
    (__)\ )\/\
    ||----w |
    || ||

    Pull XFS support for shared data extents from Dave Chinner:
    "This is the second part of the XFS updates for this merge cycle. This
    pullreq contains the new shared data extents feature for XFS.

    Given the complexity and size of this change I am expecting - like the
    addition of reverse mapping last cycle - that there will be some
    follow-up bug fixes and cleanups around the -rc3 stage for issues that
    I'm sure will show up once the code hits a wider userbase.

    What it is:

    At the most basic level we are simply adding shared data extents to
    XFS - i.e. a single extent on disk can now have multiple owners. To do
    this we have to add new on-disk features to both track the shared
    extents and the number of times they've been shared. This is done by
    the new "refcount" btree that sits in every allocation group. When we
    share or unshare an extent, this tree gets updated.

    Along with this new tree, the reverse mapping tree needs to be updated
    to track each owner or a shared extent. This also needs to be updated
    ever share/unshare operation. These interactions at extent allocation
    and freeing time have complex ordering and recovery constraints, so
    there's a significant amount of new intent-based transaction code to
    ensure that operations are performed atomically from both the runtime
    and integrity/crash recovery perspectives.

    We also need to break sharing when writes hit a shared extent - this
    is where the new copy-on-write implementation comes in. We allocate
    new storage and copy the original data along with the overwrite data
    into the new location. We only do this for data as we don't share
    metadata at all - each inode has it's own metadata that tracks the
    shared data extents, the extents undergoing CoW and it's own private
    extents.

    Of course, being XFS, nothing is simple - we use delayed allocation
    for CoW similar to how we use it for normal writes. ENOSPC is a
    significant issue here - we build on the reservation code added in
    4.8-rc1 with the reverse mapping feature to ensure we don't get
    spurious ENOSPC issues part way through a CoW operation. These
    mechanisms also help minimise fragmentation due to repeated CoW
    operations. To further reduce fragmentation overhead, we've also
    introduced a CoW extent size hint, which indicates how large a region
    we should allocate when we execute a CoW operation.

    With all this functionality in place, we can hook up .copy_file_range,
    .clone_file_range and .dedupe_file_range and we gain all the
    capabilities of reflink and other vfs provided functionality that
    enable manipulation to shared extents. We also added a fallocate mode
    that explicitly unshares a range of a file, which we implemented as an
    explicit CoW of all the shared extents in a file.

    As such, it's a huge chunk of new functionality with new on-disk
    format features and internal infrastructure. It warns at mount time as
    an experimental feature and that it may eat data (as we do with all
    new on-disk features until they stabilise). We have not released
    userspace suport for it yet - userspace support currently requires
    download from Darrick's xfsprogs repo and build from source, so the
    access to this feature is really developer/tester only at this point.
    Initial userspace support will be released at the same time the kernel
    with this code in it is released.

    The new code causes 5-6 new failures with xfstests - these aren't
    serious functional failures but things the output of tests changing
    slightly due to perturbations in layouts, space usage, etc. OTOH,
    we've added 150+ new tests to xfstests that specifically exercise this
    new functionality so it's got far better test coverage than any
    functionality we've previously added to XFS.

    Darrick has done a pretty amazing job getting us to this stage, and
    special mention also needs to go to Christoph (review, testing,
    improvements and bug fixes) and Brian (caught several intricate bugs
    during review) for the effort they've also put in.

    Summary:

    - unshare range (FALLOC_FL_UNSHARE) support for fallocate

    - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr
    interface

    - shared extent support for XFS

    - copy-on-write support for shared extents

    - copy_file_range support

    - clone_file_range support (implements reflink)

    - dedupe_file_range support

    - defrag support for reverse mapping enabled filesystems"

    * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits)
    xfs: convert COW blocks to real blocks before unwritten extent conversion
    xfs: rework refcount cow recovery error handling
    xfs: clear reflink flag if setting realtime flag
    xfs: fix error initialization
    xfs: fix label inaccuracies
    xfs: remove isize check from unshare operation
    xfs: reduce stack usage of _reflink_clear_inode_flag
    xfs: check inode reflink flag before calling reflink functions
    xfs: implement swapext for rmap filesystems
    xfs: refactor swapext code
    xfs: various swapext cleanups
    xfs: recognize the reflink feature bit
    xfs: simulate per-AG reservations being critically low
    xfs: don't mix reflink and DAX mode for now
    xfs: check for invalid inode reflink flags
    xfs: set a default CoW extent size of 32 blocks
    xfs: convert unwritten status of reverse mappings for shared files
    xfs: use interval query for rmap alloc operations on shared files
    xfs: add shared rmap map/unmap/convert log item types
    xfs: increase log reservations for reflink
    ...

    Linus Torvalds
     

10 Oct, 2016

1 commit


06 Oct, 2016

3 commits

  • Since we don't have a strategy for handling both DAX and reflink,
    for now we'll just prohibit both being set at the same time.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • We don't support sharing blocks on the realtime device. Flag inodes
    with the reflink or cowextsize flags set when the reflink feature is
    disabled.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Create a per-inode extent size allocator hint for copy-on-write. This
    hint is separate from the existing extent size hint so that CoW can
    take advantage of the fragmentation-reducing properties of extent size
    hints without disabling delalloc for regular writes.

    The extent size hint that's fed to the allocator during a copy on
    write operation is the greater of the cowextsize and regular extsize
    hint.

    During reflink, if we're sharing the entire source file to the entire
    destination file and the destination file doesn't already have a
    cowextsize hint, propagate the source file's cowextsize hint to the
    destination file.

    Furthermore, zero the bulkstat buffer prior to setting the fields
    so that we don't copy kernel memory contents into userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

22 Sep, 2016

1 commit

  • To avoid clearing of capabilities or security related extended
    attributes too early, inode_change_ok() will need to take dentry instead
    of inode. Propagate dentry down to functions calling inode_change_ok().
    This is rather straightforward except for xfs_set_mode() function which
    does not have dentry easily available. Luckily that function does not
    call inode_change_ok() anyway so we just have to do a little dance with
    function prototypes.

    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Aug, 2016

1 commit

  • In most cases, EPERM is returned on immutable inode, and there're only a
    few places returning EACCES. I noticed this when running LTP on
    overlayfs, setxattr03 failed due to unexpected EACCES on immutable
    inode.

    So converting all EACCES to EPERM on immutable inode.

    Acked-by: Dave Chinner
    Signed-off-by: Eryu Guan
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Eryu Guan
     

03 Aug, 2016

1 commit


20 Jul, 2016

3 commits