05 Jun, 2018

1 commit

  • commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream.

    The struct xfs_agfl v5 header was originally introduced with
    unexpected padding that caused the AGFL to operate with one less
    slot than intended. The header has since been packed, but the fix
    left an incompatibility for users who upgrade from an old kernel
    with the unpacked header to a newer kernel with the packed header
    while the AGFL happens to wrap around the end. The newer kernel
    recognizes one extra slot at the physical end of the AGFL that the
    previous kernel did not. The new kernel will eventually attempt to
    allocate a block from that slot, which contains invalid data, and
    cause a crash.

    This condition can be detected by comparing the active range of the
    AGFL to the count. While this detects a padding mismatch, it can
    also trigger false positives for unrelated flcount corruption. Since
    we cannot distinguish a size mismatch due to padding from unrelated
    corruption, we can't trust the AGFL enough to simply repopulate the
    empty slot.

    Instead, avoid unnecessarily complex detection logic and and use a
    solution that can handle any form of flcount corruption that slips
    through read verifiers: distrust the entire AGFL and reset it to an
    empty state. Any valid blocks within the AGFL are intentionally
    leaked. This requires xfs_repair to rectify (which was already
    necessary based on the state the AGFL was found in). The reset
    mitigates the side effect of the padding mismatch problem from a
    filesystem crash to a free space accounting inconsistency. The
    generic approach also means that this patch can be safely backported
    to kernels with or without a packed struct xfs_agfl.

    Check the AGF for an invalid freelist count on initial read from
    disk. If detected, set a flag on the xfs_perag to indicate that a
    reset is required before the AGFL can be used. In the first
    transaction that attempts to use a flagged AGFL, reset it to empty,
    warn the user about the inconsistency and allow the freelist fixup
    code to repopulate the AGFL with new blocks. The xfs_perag flag is
    cleared to eliminate the need for repeated checks on each block
    allocation operation.

    This allows kernels that include the packing fix commit 96f859d52bcb
    ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
    to handle older unpacked AGFL formats without a filesystem crash.

    Suggested-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by Dave Chiluk
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

02 Sep, 2017

2 commits

  • Add a new __xfs_filemap_fault helper that implements all four page fault
    callouts, and make these methods themselves small stubs that set the
    correct write_fault flag, and exit early for the non-DAX case for the
    hugepage related ones.

    Also remove the extra size checking in the pfn_fault path, which is now
    handled in the core DAX code.

    Life would be so much simpler if we only had one method for all this.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Ordered buffers pass through the logging infrastructure without ever
    being written to the log. The way this works is that the ordered
    buffer status is transferred to the log vector at commit time via
    the ->iop_size() callback. In xlog_cil_insert_format_items(),
    ordered log vectors bypass ->iop_format() processing altogether.

    Therefore it is unnecessary for xfs_buf_item_format() to handle
    ordered buffers. Remove the unnecessary logic and assert that an
    ordered buffer never reaches this point.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

23 Aug, 2017

1 commit

  • Torn write detection and tail overwrite detection can shift the log
    head and tail respectively in the event of CRC mismatch or
    corruption errors. Add a high-level log recovery tracepoint to dump
    the final log head/tail and make those values easily attainable in
    debug/diagnostic situations.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

20 Jun, 2017

1 commit

  • This is a purely mechanical patch that removes the private
    __{u,}int{8,16,32,64}_t typedefs in favor of using the system
    {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
    the transformation and fix the resulting whitespace and indentation
    errors:

    s/typedef\t__uint8_t/typedef __uint8_t\t/g
    s/typedef\t__uint/typedef __uint/g
    s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
    s/__uint8_t\t/__uint8_t\t\t/g
    s/__uint/uint/g
    s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
    s/__int/int/g
    /^typedef.*int[0-9]*_t;$/d

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

19 Jun, 2017

2 commits

  • The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
    storage for the checkpoint sequence number only in the current code.

    And the start/commit lsn are tracked as a transaction group tag in
    the xfs_cil_ctx instead of a single transaction, so remove them from
    the xfs_trans structure and their users to match with the design.

    Signed-off-by: Shan Hai
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Shan Hai
     
  • Reclaim during quotacheck can lead to deadlocks on the dquot flush
    lock:

    - Quotacheck populates a local delwri queue with the physical dquot
    buffers.
    - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
    dirties all of the dquots.
    - Reclaim kicks in and attempts to flush a dquot whose buffer is
    already queud on the quotacheck queue. The flush succeeds but
    queueing to the reclaim delwri queue fails as the backing buffer is
    already queued. The flush unlock is now deferred to I/O completion
    of the buffer from the quotacheck queue.
    - The dqadjust bulkstat continues and dirties the recently flushed
    dquot once again.
    - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
    the flush lock to update the backing buffers with the in-core
    recalculated values. It deadlocks on the redirtied dquot as the
    flush lock was already acquired by reclaim, but the buffer resides
    on the local delwri queue which isn't submitted until the end of
    quotacheck.

    This is reproduced by running quotacheck on a filesystem with a
    couple million inodes in low memory (512MB-1GB) situations. This is
    a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
    buffer lists"), which removed a trylock and buffer I/O submission
    from the quotacheck dquot flush sequence.

    Quotacheck first resets and collects the physical dquot buffers in a
    delwri queue. Then, it traverses the filesystem inodes via bulkstat,
    updates the in-core dquots, flushes the corrected dquots to the
    backing buffers and finally submits the delwri queue for I/O. Since
    the backing buffers are queued across the entire quotacheck
    operation, dquot reclaim cannot possibly complete a dquot flush
    before quotacheck completes.

    Therefore, quotacheck must submit the buffer for I/O in order to
    cycle the flush lock and flush the dirty in-core dquot to the
    buffer. Add a delwri queue buffer push mechanism to submit an
    individual buffer for I/O without losing the delwri queue status and
    use it from quotacheck to avoid the deadlock. This restores
    quotacheck behavior to as before the regression was introduced.

    Reported-by: Martin Svec
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

26 Apr, 2017

3 commits

  • Using bool values produces sparse warnings of this form:

    fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
    fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
    fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
    fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
    fs/xfs/./xfs_trace.h:2307:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)

    Just use a char instead to fix those up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The trailing newlines wil lead to extra newlines in the trace file
    which looks like the following output, so remove them.
    >kworker/4:1H-1508 [004] .... 47879.101608: xfs_discard_extent: dev 8:0
    >
    >kworker/u16:2-238 [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0

    Signed-off-by: Hou Tao
    Reviewed-by: Darrick J. Wong
    [darrick: fix the getfsmap tracepoints too]
    Signed-off-by: Darrick J. Wong

    Hou Tao
     
  • The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
    to what we do in the space allocator. But the reflink code doesn't touch
    the allocation btree unlike the normal space allocator, so we couldn't
    care less about the state of the AGFL.

    So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
    the caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

04 Apr, 2017

1 commit


25 Feb, 2017

1 commit

  • Patch series "1G transparent hugepage support for device dax", v2.

    The following series implements support for 1G trasparent hugepage on
    x86 for device dax. The bulk of the code was written by Mathew Wilcox a
    while back supporting transparent 1G hugepage for fs DAX. I have
    forward ported the relevant bits to 4.10-rc. The current submission has
    only the necessary code to support device DAX.

    Comments from Dan Williams: So the motivation and intended user of this
    functionality mirrors the motivation and users of 1GB page support in
    hugetlbfs. Given expected capacities of persistent memory devices an
    in-memory database may want to reduce tlb pressure beyond what they can
    already achieve with 2MB mappings of a device-dax file. We have
    customer feedback to that effect as Willy mentioned in his previous
    version of these patches [1].

    [1]: https://lkml.org/lkml/2016/1/31/52

    Comments from Nilesh @ Oracle:

    There are applications which have a process model; and if you assume
    10,000 processes attempting to mmap all the 6TB memory available on a
    server; we are looking at the following:

    processes : 10,000
    memory : 6TB
    pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
    pmd @ 2M page size: 120,000 / 512 = ~240GB
    pud @ 1G page size: 240GB / 512 = ~480MB

    As you can see with 2M pages, this system will use up an exorbitant
    amount of DRAM to hold the page tables; but the 1G pages finally brings
    it down to a reasonable level. Memory sizes will keep increasing; so
    this number will keep increasing.

    An argument can be made to convert the applications from process model
    to thread model, but in the real world that may not be always practical.
    Hopefully this helps explain the use case where this is valuable.

    This patch (of 3):

    In preparation for adding the ability to handle PUD pages, convert
    vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
    vm_fault structure is extended to include a union of the different page
    table pointers that may be needed, and three flag bits are reserved to
    indicate which type of pointer is in the union.

    [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
    Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
    [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
    Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

07 Feb, 2017

2 commits

  • Instead of preallocating all the required COW blocks in the high-level
    write code do it inside the iomap code, like we do for all other I/O.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We currently fall back from direct to buffered writes if we detect a
    remaining shared extent in the iomap_begin callback. But by the time
    iomap_begin is called for the potentially unaligned end block we might
    have already written most of the data to disk, which we'd now write
    again using buffered I/O. To avoid this reject all writes to reflinked
    files before starting I/O so that we are guaranteed to only write the
    data once.

    The alternative would be to unshare the unaligned start and/or end block
    before doing the I/O. I think that's doable, and will actually be
    required to support reflinks on DAX file system. But it will take a
    little more time and I'd rather get rid of the double write ASAP.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

03 Feb, 2017

1 commit

  • Christoph Hellwig pointed out that there's a potentially nasty race when
    performing simultaneous nearby directio cow writes:

    "Thread 1 writes a range from B to c

    " B --------- C
    p

    "a little later thread 2 writes from A to B

    " A --------- B
    p

    [editor's note: the 'p' denote cowextsize boundaries, which I added to
    make this more clear]

    "but the code preallocates beyond B into the range where thread
    "1 has just written, but ->end_io hasn't been called yet.
    "But once ->end_io is called thread 2 has already allocated
    "up to the extent size hint into the write range of thread 1,
    "so the end_io handler will splice the unintialized blocks from
    "that preallocation back into the file right after B."

    We can avoid this race by ensuring that thread 1 cannot accidentally
    remap the blocks that thread 2 allocated (as part of speculative
    preallocation) as part of t2's write preparation in t1's end_io handler.
    The way we make this happen is by taking advantage of the unwritten
    extent flag as an intermediate step.

    Recall that when we begin the process of writing data to shared blocks,
    we create a delayed allocation extent in the CoW fork:

    D: --RRRRRRSSSRRRRRRRR---
    C: ------DDDDDDD---------

    When a thread prepares to CoW some dirty data out to disk, it will now
    convert the delalloc reservation into an /unwritten/ allocated extent in
    the cow fork. The da conversion code tries to opportunistically
    allocate as much of a (speculatively prealloc'd) extent as possible, so
    we may end up allocating a larger extent than we're actually writing
    out:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UUUUUUU---------

    Next, we convert only the part of the extent that we're actively
    planning to write to normal (i.e. not unwritten) status:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UURRUUU---------

    If the write succeeds, the end_cow function will now scan the relevant
    range of the CoW fork for real extents and remap only the real extents
    into the data fork:

    D: --RRRRRRRRSRRRRRRRR---
    U: ------UU--UUU---------

    This ensures that we never obliterate valid data fork extents with
    unwritten blocks from the CoW fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

31 Jan, 2017

1 commit

  • After scratching my head looking for "xfs_busy_extent" I realized
    it's not used; it's xfs_extent_busy, and the declaration for the
    other name is bogus. Remove that and a few others as well.

    (struct xfs_log_callback is used, but the 2nd declaration is
    unnecessary).

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

09 Dec, 2016

1 commit


20 Oct, 2016

2 commits

  • Instead of doing a full extent list search for each extent that is
    to be deleted using xfs_bmapi_read and then doing another one inside
    of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses: look
    up the last extent to be deleted and then use the extent index to
    walk downward until we are outside the range to be deleted.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Instead of reserving space as the first thing in write_begin move it past
    reading the extent in the data fork. That way we only have to read from
    the data fork once and can reuse that information for trimming the extent
    to the shared/unshared boundary. Additionally this allows to easily
    limit the actual write size to said boundary, and avoid a roundtrip on the
    ilock.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

14 Oct, 2016

1 commit

  • …kernel/git/dgc/linux-xfs

    < XFS has gained super CoW powers! >
    ----------------------------------
    \ ^__^
    \ (oo)\_______
    (__)\ )\/\
    ||----w |
    || ||

    Pull XFS support for shared data extents from Dave Chinner:
    "This is the second part of the XFS updates for this merge cycle. This
    pullreq contains the new shared data extents feature for XFS.

    Given the complexity and size of this change I am expecting - like the
    addition of reverse mapping last cycle - that there will be some
    follow-up bug fixes and cleanups around the -rc3 stage for issues that
    I'm sure will show up once the code hits a wider userbase.

    What it is:

    At the most basic level we are simply adding shared data extents to
    XFS - i.e. a single extent on disk can now have multiple owners. To do
    this we have to add new on-disk features to both track the shared
    extents and the number of times they've been shared. This is done by
    the new "refcount" btree that sits in every allocation group. When we
    share or unshare an extent, this tree gets updated.

    Along with this new tree, the reverse mapping tree needs to be updated
    to track each owner or a shared extent. This also needs to be updated
    ever share/unshare operation. These interactions at extent allocation
    and freeing time have complex ordering and recovery constraints, so
    there's a significant amount of new intent-based transaction code to
    ensure that operations are performed atomically from both the runtime
    and integrity/crash recovery perspectives.

    We also need to break sharing when writes hit a shared extent - this
    is where the new copy-on-write implementation comes in. We allocate
    new storage and copy the original data along with the overwrite data
    into the new location. We only do this for data as we don't share
    metadata at all - each inode has it's own metadata that tracks the
    shared data extents, the extents undergoing CoW and it's own private
    extents.

    Of course, being XFS, nothing is simple - we use delayed allocation
    for CoW similar to how we use it for normal writes. ENOSPC is a
    significant issue here - we build on the reservation code added in
    4.8-rc1 with the reverse mapping feature to ensure we don't get
    spurious ENOSPC issues part way through a CoW operation. These
    mechanisms also help minimise fragmentation due to repeated CoW
    operations. To further reduce fragmentation overhead, we've also
    introduced a CoW extent size hint, which indicates how large a region
    we should allocate when we execute a CoW operation.

    With all this functionality in place, we can hook up .copy_file_range,
    .clone_file_range and .dedupe_file_range and we gain all the
    capabilities of reflink and other vfs provided functionality that
    enable manipulation to shared extents. We also added a fallocate mode
    that explicitly unshares a range of a file, which we implemented as an
    explicit CoW of all the shared extents in a file.

    As such, it's a huge chunk of new functionality with new on-disk
    format features and internal infrastructure. It warns at mount time as
    an experimental feature and that it may eat data (as we do with all
    new on-disk features until they stabilise). We have not released
    userspace suport for it yet - userspace support currently requires
    download from Darrick's xfsprogs repo and build from source, so the
    access to this feature is really developer/tester only at this point.
    Initial userspace support will be released at the same time the kernel
    with this code in it is released.

    The new code causes 5-6 new failures with xfstests - these aren't
    serious functional failures but things the output of tests changing
    slightly due to perturbations in layouts, space usage, etc. OTOH,
    we've added 150+ new tests to xfstests that specifically exercise this
    new functionality so it's got far better test coverage than any
    functionality we've previously added to XFS.

    Darrick has done a pretty amazing job getting us to this stage, and
    special mention also needs to go to Christoph (review, testing,
    improvements and bug fixes) and Brian (caught several intricate bugs
    during review) for the effort they've also put in.

    Summary:

    - unshare range (FALLOC_FL_UNSHARE) support for fallocate

    - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr
    interface

    - shared extent support for XFS

    - copy-on-write support for shared extents

    - copy_file_range support

    - clone_file_range support (implements reflink)

    - dedupe_file_range support

    - defrag support for reverse mapping enabled filesystems"

    * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits)
    xfs: convert COW blocks to real blocks before unwritten extent conversion
    xfs: rework refcount cow recovery error handling
    xfs: clear reflink flag if setting realtime flag
    xfs: fix error initialization
    xfs: fix label inaccuracies
    xfs: remove isize check from unshare operation
    xfs: reduce stack usage of _reflink_clear_inode_flag
    xfs: check inode reflink flag before calling reflink functions
    xfs: implement swapext for rmap filesystems
    xfs: refactor swapext code
    xfs: various swapext cleanups
    xfs: recognize the reflink feature bit
    xfs: simulate per-AG reservations being critically low
    xfs: don't mix reflink and DAX mode for now
    xfs: check for invalid inode reflink flags
    xfs: set a default CoW extent size of 32 blocks
    xfs: convert unwritten status of reverse mappings for shared files
    xfs: use interval query for rmap alloc operations on shared files
    xfs: add shared rmap map/unmap/convert log item types
    xfs: increase log reservations for reflink
    ...

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • Pull VFS splice updates from Al Viro:
    "There's a bunch of branches this cycle, both mine and from other folks
    and I'd rather send pull requests separately.

    This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
    (and introduction of such). Gets rid of a lot of code in fs/splice.c
    and elsewhere; there will be followups, but these are for the next
    cycle... Some pipe/splice-related cleanups from Miklos in the same
    branch as well"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    pipe: fix comment in pipe_buf_operations
    pipe: add pipe_buf_steal() helper
    pipe: add pipe_buf_confirm() helper
    pipe: add pipe_buf_release() helper
    pipe: add pipe_buf_get() helper
    relay: simplify relay_file_read()
    switch default_file_splice_read() to use of pipe-backed iov_iter
    switch generic_file_splice_read() to use of ->read_iter()
    new iov_iter flavour: pipe-backed
    fuse_dev_splice_read(): switch to add_to_pipe()
    skb_splice_bits(): get rid of callback
    new helper: add_to_pipe()
    splice: lift pipe_lock out of splice_to_pipe()
    splice: switch get_iovec_page_array() to iov_iter
    splice_to_pipe(): don't open-code wakeup_pipe_readers()
    consistent treatment of EFAULT on O_DIRECT read/write

    Linus Torvalds
     

06 Oct, 2016

6 commits

  • Implement swapext for filesystems that have reverse mapping. Back in
    the reflink patches, we augmented the bmap code with a 'REMAP' flag
    that updates only the bmbt and doesn't touch the allocator and
    implemented log redo items for those two operations. Now we can
    rewrite extent swapping as a (looong) series of remap operations.

    This is far less efficient than the fork swapping method implemented
    in the past, so we only switch this on for rmap.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When it's possible for reverse mappings to overlap (data fork extents
    of files on reflink filesystems), use the interval query function to
    find the left neighbor of an extent we're trying to add; and be
    careful to use the lookup functions to update the neighbors and/or
    add new extents.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Due to the way the CoW algorithm in XFS works, there's an interval
    during which blocks allocated to handle a CoW can be lost -- if the FS
    goes down after the blocks are allocated but before the block
    remapping takes place. This is exacerbated by the cowextsz hint --
    allocated reservations can sit around for a while, waiting to get
    used.

    Since the refcount btree doesn't normally store records with refcount
    of 1, we can use it to record these in-progress extents. In-progress
    blocks cannot be shared because they're not user-visible, so there
    shouldn't be any conflicts with other programs. This is a better
    solution than holding EFIs during writeback because (a) EFIs can't be
    relogged currently, (b) even if they could, EFIs are bound by
    available log space, which puts an unnecessary upper bound on how much
    CoW we can have in flight, and (c) we already have a mechanism to
    track blocks.

    At mount time, read the refcount records and free anything we find
    with a refcount of 1 because those were in-progress when the FS went
    down.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • ... and kill the ->splice_read() instances that can be switched to it

    Signed-off-by: Al Viro

    Al Viro
     

05 Oct, 2016

5 commits


04 Oct, 2016

4 commits


03 Oct, 2016

1 commit


26 Sep, 2016

2 commits

  • Log recovery has particular rules around buffer submission along with
    tricky corner cases where independent transactions can share an LSN. As
    such, it can be difficult to follow when/why buffers are submitted
    during recovery.

    Add a couple tracepoints to post the current LSN of a record when a new
    record is being processed and when a buffer is being skipped due to LSN
    ordering. Also, update the recover item class to include the LSN of the
    current transaction for the item being processed.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • When adding a new remote attribute, we write the attribute to the
    new extent before the allocation transaction is committed. This
    means we cannot reuse busy extents as that violates crash
    consistency semantics. Hence we currently treat remote attribute
    extent allocation like userdata because it has the same overwrite
    ordering constraints as userdata.

    Unfortunately, this also allows the allocator to incorrectly apply
    extent size hints to the remote attribute extent allocation. This
    results in interesting failures, such as transaction block
    reservation overruns and in-memory inode attribute fork corruption.

    To fix this, we need to separate the busy extent reuse configuration
    from the userdata configuration. This changes the definition of
    XFS_BMAPI_METADATA slightly - it now means that allocation is
    metadata and reuse of busy extents is acceptible due to the metadata
    ordering semantics of the journal. If this flag is not set, it
    means the allocation is that has unordered data writeback, and hence
    busy extent reuse is not allowed. It no longer implies the
    allocation is for user data, just that the data write will not be
    strictly ordered. This matches the semantics for both user data
    and remote attribute block allocation.

    As such, This patch changes the "userdata" field to a "datatype"
    field, and adds a "no busy reuse" flag to the field.
    When we detect an unordered data extent allocation, we immediately set
    the no reuse flag. We then set the "user data" flags based on the
    inode fork we are allocating the extent to. Hence we only set
    userdata flags on data fork allocations now and consider attribute
    fork remote extents to be an unordered metadata extent.

    The result is that remote attribute extents now have the expected
    allocation semantics, and the data fork allocation behaviour is
    completely unchanged.

    It should be noted that there may be other ways to fix this (e.g.
    use ordered metadata buffers for the remote attribute extent data
    write) but they are more invasive and difficult to validate both
    from a design and implementation POV. Hence this patch takes the
    simple, obvious route to fixing the problem...

    Reported-and-tested-by: Ross Zwisler
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

19 Sep, 2016

1 commit

  • One unfortunate quirk of the reference count and reverse mapping
    btrees -- they can expand in size when blocks are written to *other*
    allocation groups if, say, one large extent becomes a lot of tiny
    extents. Since we don't want to start throwing errors in the middle
    of CoWing, we need to reserve some blocks to handle future expansion.
    The transaction block reservation counters aren't sufficient here
    because we have to have a reserve of blocks in every AG, not just
    somewhere in the filesystem.

    Therefore, create two per-AG block reservation pools. One feeds the
    AGFL so that rmapbt expansion always succeeds, and the other feeds all
    other metadata so that refcountbt expansion never fails.

    Use the count of how many reserved blocks we need to have on hand to
    create a virtual reservation in the AG. Through selective clamping of
    the maximum length of allocation requests and of the length of the
    longest free extent, we can make it look like there's less free space
    in the AG unless the reservation owner is asking for blocks.

    In other words, play some accounting tricks in-core to make sure that
    we always have blocks available. On the plus side, there's nothing to
    clean up if we crash, which is contrast to the strategy that the rough
    draft used (actually removing extents from the freespace btrees).

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong