05 Oct, 2016

1 commit

  • Modify the writepage handler to find and convert pending delalloc
    extents to real allocations. Furthermore, when we're doing non-cow
    writes to a part of a file that already has a CoW reservation (the
    cowextsz hint that we set up in a subsequent patch facilitates this),
    promote the write to copy-on-write so that the entire extent can get
    written out as a single extent on disk, thereby reducing post-CoW
    fragmentation.

    Christoph moved the CoW support code in _map_blocks to a separate helper
    function, refactored other functions, and reduced the number of CoW fork
    lookups, so I merged those changes here to reduce churn.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     

19 Sep, 2016

1 commit

  • Rename the current function to __xfs_setfilesize and add a non-static
    wrapper that also takes care of creating the transaction. This new
    helper will be used by the new iomap-based DAX path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

20 Jul, 2016

1 commit


06 Apr, 2016

3 commits

  • This patch implements two closely related changes: First it embeds
    a bio the ioend structure so that we don't have to allocate one
    separately. Second it uses the block layer bio chaining mechanism
    to chain additional bios off this first one if needed instead of
    manually accounting for multiple bio completions in the ioend
    structure. Together this removes a memory allocation per ioend and
    greatly simplifies the ioend setup and I/O completion path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Completion of an ioend requires us to walk the bufferhead list to
    end writback on all the bufferheads. This, in turn, is needed so
    that we can end writeback on all the pages we just did IO on.

    To remove our dependency on bufferheads in writeback, we need to
    turn this around the other way - we need to walk the pages we've
    just completed IO on, and then walk the buffers attached to the
    pages and complete their IO. In doing this, we remove the
    requirement for the ioend to track bufferheads directly.

    To enable IO completion to walk all the pages we've submitted IO on,
    we need to keep the bios that we used for IO around until the ioend
    has been completed. We can do this simply by chaining the bios to
    the ioend at completion time, and then walking their pages directly
    just before destroying the ioend.

    Signed-off-by: Dave Chinner
    [hch: changed the xfs_finish_page_writeback calling convention]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Currently adding a buffer to the ioend and then building a bio from
    the buffer list are two separate operations. We don't build the bios
    and submit them until the ioend is submitted, and this places a
    fixed dependency on bufferhead chaining in the ioend.

    The first step to removing the bufferhead chaining in the ioend is
    on the IO submission side. We can build the bio directly as we add
    the buffers to the ioend chain, thereby removing the need for a
    latter "buffer-to-bio" submission loop. This allows us to submit
    bios on large ioends as soon as we cannot add more data to the bio.

    These bios then get captured by the active plug, and hence will be
    dispatched as soon as either the plug overflows or we schedule away
    from the writeback context. This will reduce submission latency for
    large IOs, but will also allow more timely request queue based
    writeback blocking when the device becomes congested.

    Signed-off-by: Dave Chinner
    [hch: various small updates]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

28 Feb, 2016

1 commit

  • dax_clear_blocks() needs a valid struct block_device and previously it
    was using inode->i_sb->s_bdev in all cases. This is correct for normal
    inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
    DAX raw block devices and for XFS real-time devices.

    Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
    its arguments to take a bdev and a sector instead of an inode and a
    block. This better reflects what the function does, and it allows the
    filesystem and raw block device code to pass in an appropriate struct
    block_device.

    Signed-off-by: Ross Zwisler
    Suggested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Theodore Ts'o
    Cc: Al Viro
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

15 Feb, 2016

2 commits

  • Currently we can build a long ioend chain during ->writepages that
    gets attached to the writepage context. IO submission only then
    occurs when we finish all the writepage processing. This means we
    can have many ioends allocated and pending, and this violates the
    mempool guarantees that we need to give about forwards progress.
    i.e. we really should only have one ioend being built at a time,
    otherwise we may drain the mempool trying to allocate a new ioend
    and that blocks submission, completion and freeing of ioends that
    are already in progress.

    To prevent this situation from happening, we need to submit ioends
    for IO as soon as they are ready for dispatch rather than queuing
    them for later submission. This means the ioends have bios built
    immediately and they get queued on any plug that is current active.
    Hence if we schedule away from writeback, the ioends that have been
    built will make forwards progress due to the plug flushing on
    context switch. This will also prevent context switches from
    creating unnecessary IO submission latency.

    We can't completely avoid having nested IO allocation - when we have
    a block size smaller than a page size, we still need to hold the
    ioend submission until after we have marked the current page dirty.
    Hence we may need multiple ioends to be held while the current page
    is completely mapped and made ready for IO dispatch. We cannot avoid
    this problem - the current code already has this ioend chaining
    within a page so we can mostly ignore that it occurs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs_vm_writepages() calls generic_writepages to writeback a range of
    a file, but then xfs_vm_writepage() clusters pages itself as it does
    not have any context it can pass between->writepage calls from
    __write_cache_pages().

    Introduce a writeback context for xfs_vm_writepages() and call
    __write_cache_pages directly with our own writepage callback so that
    we can pass that context to each writepage invocation. This
    encapsulates the current mapping, whether it is valid or not, the
    current ioend and it's IO type and the ioend chain being built.

    This requires us to move the ioend submission up to the level where
    the writepage context is declared. This does mean we do not submit
    IO until we packaged the entire writeback range, but with the block
    plugging in the writepages call this is the way IO is submitted,
    anyway.

    It also means that we need to handle discontiguous page ranges. If
    the pages sent down by write_cache_pages to the writepage callback
    are discontiguous, we need to detect this and put each discontiguous
    page range into individual ioends. This is needed to ensure that the
    ioend accurately represents the range of the file that it covers so
    that file size updates during IO completion set the size correctly.
    Failure to take into account the discontiguous ranges results in
    files being too small when writeback patterns are non-sequential.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

03 Nov, 2015

2 commits

  • For DAX, we are now doing block zeroing during allocation. This
    means we no longer need a special DAX fault IO completion callback
    to do unwritten extent conversion. Because mmap never extends the
    file size (it SEGVs the process) we don't need a callback to update
    the file size, either. Hence we can remove the completion callbacks
    from the __dax_fault and __dax_mkwrite calls.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Both direct IO and DAX pass an offset and count into get_blocks that
    will overflow a s64 variable when an IO goes into the last supported
    block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
    from the tracing:

    xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
    xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096
    xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

    0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
    overflows the s64 offset and we fail to detect the need for a
    filesize update and an ioend is not allocated.

    This is *mostly* avoided for direct IO because such extending IOs
    occur with full block allocation, and so the "IS_UNWRITTEN()" check
    still evaluates as true and we get an ioend that way. However, doing
    single sector extending IOs to this last block will expose the fact
    that file size updates will not occur after the first allocating
    direct IO as the overflow will then be exposed.

    There is one further complexity: the DAX page fault path also
    exposes the same issue in block allocation. However, page faults
    cannot extend the file size, so in this case we want to allocate the
    block but do not want to allocate an ioend to enable file size
    update at IO completion. Hence we now need to distinguish between
    the direct IO patch allocation and dax fault path allocation to
    avoid leaking ioend structures.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

04 Jun, 2015

1 commit

  • Add the initial support for DAX file operations to XFS. This
    includes the necessary block allocation and mmap page fault hooks
    for DAX to function.

    Note that there are changes to the splice interfaces to ensure that
    for DAX splice avoids direct page cache manipulations and instead
    takes the DAX IO paths for read/write operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

02 Feb, 2015

1 commit

  • Back in the days when the direct I/O ->end_io callback could be called
    from interrupt context for AIO we needed a structure to hand off to the
    workqueue, and reused the ioend structure for this purpose. These days
    ->end_io is always called from user or workqueue context, which allows us
    to avoid this memory allocation and simplify the code significantly.

    [dchinner: removed now unused xfs_finish_ioend_sync() function after
    Brian Foster did an initial review. ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

04 Sep, 2013

1 commit

  • Add support to the core direct-io code to defer AIO completions to user
    context using a workqueue. This replaces opencoded and less efficient
    code in XFS and ext4 (we save a memory allocation for each direct IO)
    and will be needed to properly support O_(D)SYNC for AIO.

    The communication between the filesystem and the direct I/O code requires
    a new buffer head flag, which is a bit ugly but not avoidable until the
    direct I/O code stops abusing the buffer_head structure for communicating
    with the filesystems.

    Currently this creates a per-superblock unbound workqueue for these
    completions, which is taken from an earlier patch by Jan Kara. I'm
    not really convinced about this use and would prefer a "normal" global
    workqueue with a high concurrency limit, but this needs further discussion.

    JK: Fixed ext4 part, dynamic allocation of the workqueue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

23 Jul, 2012

1 commit


14 Mar, 2012

1 commit

  • Do not use unlogged metadata updates and the VFS dirty bit for updating
    the file size after writeback. In addition to causing various problems
    with updates getting delayed for far too long this also drags in the
    unscalable VFS dirty tracking, and is one of the few remaining unlogged
    metadata updates.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

06 Mar, 2012

1 commit

  • The new concurrency managed workqueues are cheap enough that we can create
    per-filesystem instead of global workqueues. This allows us to remove the
    trylock or defer scheme on the ilock, which is not helpful once we have
    outstanding log reservations until finishing a size update.

    Also allow the default concurrency on this workqueues so that I/O completions
    blocking on the ilock for one inode do not block process for another inode.

    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

12 Oct, 2011

2 commits

  • We now have an i_dio_count filed and surrounding infrastructure to wait
    for direct I/O completion instead of i_icount, and we have never needed
    to iocount waits for buffered I/O given that we only set the page uptodate
    after finishing all required work. Thus remove i_iocount, and replace
    the actually needed waits with calls to inode_dio_wait.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We really shouldn't complete AIO or DIO requests until we have finished
    the unwritten extent conversion and size update. This means fsync never
    has to pick up any ioends as all work has been completed when signalling
    I/O completion.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig