20 Jan, 2017

1 commit

  • commit 0a417b8dc1f10b03e8f558b8a831f07ec4c23795 upstream.

    Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
    to skip dirty pages in xfs_vm_releasepage() which also has the effect
    that if a dirty page is truncated, it does not get freed by
    block_invalidatepage() and is lingering in LRU list waiting for reclaim.
    So a simple loop like:

    while true; do
    dd if=/dev/zero of=file bs=1M count=100
    rm file
    done

    will keep using more and more memory until we hit low watermarks and
    start pagecache reclaim which will eventually reclaim also the truncate
    pages. Keeping these truncated (and thus never usable) pages in memory
    is just a waste of memory, is unnecessarily stressing page cache
    reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
    prematurely.

    So instead of just skipping dirty pages in xfs_vm_releasepage(), return
    to old behavior of skipping them only if they have delalloc or unwritten
    buffers and fix the spurious warnings by warning only if the page is
    clean.

    CC: Brian Foster
    CC: Vlastimil Babka
    Reported-by: Petr Tůma
    Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
    Signed-off-by: Jan Kara
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

12 Jan, 2017

1 commit

  • commit 04197b341f23b908193308b8d63d17ff23232598 upstream.

    We've had reports of generic/095 causing XFS to BUG() in
    __xfs_get_blocks() due to the existence of delalloc blocks on a
    direct I/O read. generic/095 issues a mix of various types of I/O,
    including direct and memory mapped I/O to a single file. This is
    clearly not supported behavior and is known to lead to such
    problems. E.g., the lack of exclusion between the direct I/O and
    write fault paths means that a write fault can allocate delalloc
    blocks in a region of a file that was previously a hole after the
    direct read has attempted to flush/inval the file range, but before
    it actually reads the block mapping. In turn, the direct read
    discovers a delalloc extent and cannot proceed.

    While the appropriate solution here is to not mix direct and memory
    mapped I/O to the same regions of the same file, the current
    BUG_ON() behavior is probably overkill as it can crash the entire
    system. Instead, localize the failure to the I/O in question by
    returning an error for a direct I/O that cannot be handled safely
    due to delalloc blocks. Be careful to allow the case of a direct
    write to post-eof delalloc blocks. This can occur due to speculative
    preallocation and is safe as post-eof blocks are not accompanied by
    dirty pages in pagecache (conversely, preallocation within eof must
    have been zeroed, and thus dirtied, before the inode size could have
    been increased beyond said blocks).

    Finally, provide an additional warning if a direct I/O write occurs
    while the file is memory mapped. This may not catch all problematic
    scenarios, but provides a hint that some known-to-be-problematic I/O
    methods are in use.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

11 Oct, 2016

1 commit

  • We need to splice COW blocks we've completed in xfs_end_io_direct_write
    into the data fork before converting unwritten extents. Otherwise
    xfs_bmapi_write might first allocate blocks for any holes in the data
    fork, which isn't only not needed but also harmful as it might cause
    reserved block underruns in the transaction.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

06 Oct, 2016

3 commits

  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Report shared extents through the iomap interface so that FIEMAP flags
    shared blocks accurately. Have xfs_vm_bmap return zero for reflinked
    files because the bmap-based swap code requires static block mappings,
    which is incompatible with copy on write.

    NOTE: Existing userspace bmap users such as lilo will have the same
    problem with reflink files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • After the write component of a copy-write operation finishes, clean up
    the bookkeeping left behind. On error, we simply free the new blocks
    and pass the error up. If we succeed, however, then we must remove
    the old data fork mapping and move the cow fork mapping to the data
    fork.

    Signed-off-by: Darrick J. Wong
    [hch: Call the CoW failure function during xfs_cancel_ioend]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     

05 Oct, 2016

2 commits

  • Modify the writepage handler to find and convert pending delalloc
    extents to real allocations. Furthermore, when we're doing non-cow
    writes to a part of a file that already has a CoW reservation (the
    cowextsz hint that we set up in a subsequent patch facilitates this),
    promote the write to copy-on-write so that the entire extent can get
    written out as a single extent on disk, thereby reducing post-CoW
    fragmentation.

    Christoph moved the CoW support code in _map_blocks to a separate helper
    function, refactored other functions, and reduced the number of CoW fork
    lookups, so I merged those changes here to reduce churn.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
    allocation extents in the CoW fork to real allocations, and wire this
    up all the way back to xfs_iomap_write_allocate(). In a subsequent
    patch, we'll modify the writepage handler to call this.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

19 Sep, 2016

1 commit

  • Rename the current function to __xfs_setfilesize and add a non-static
    wrapper that also takes care of creating the transaction. This new
    helper will be used by the new iomap-based DAX path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

22 Jul, 2016

3 commits

  • Dave Chinner
     
  • In xfs_finish_page_writeback(), we have a loop that looks like this:

    do {
    if (off < bvec->bv_offset)
    goto next_bh;
    if (off > end)
    break;
    bh->b_end_io(bh, !error);
    next_bh:
    off += bh->b_size;
    } while ((bh = bh->b_this_page) != head);

    The b_end_io function is end_buffer_async_write(), which will call
    end_page_writeback() once all the buffers have marked as no longer
    under IO. This issue here is that the only thing currently
    protecting both the bufferhead chain and the page from being
    reclaimed is the PageWriteback state held on the page.

    While we attempt to limit the loop to just the buffers covered by
    the IO, we still read from the buffer size and follow the next
    pointer in the bufferhead chain. There is no guarantee that either
    of these are valid after the PageWriteback flag has been cleared.
    Hence, loops like this are completely unsafe, and result in
    use-after-free issues. One such problem was caught by Calvin Owens
    with KASAN:

    .....
    INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
    free_buffer_head+0x41/0x90
    __slab_free+0x1ed/0x340
    kmem_cache_free+0x270/0x300
    free_buffer_head+0x41/0x90
    try_to_free_buffers+0x171/0x240
    xfs_vm_releasepage+0xcb/0x3b0
    try_to_release_page+0x106/0x190
    shrink_page_list+0x118e/0x1a10
    shrink_inactive_list+0x42c/0xdf0
    shrink_zone_memcg+0xa09/0xfa0
    shrink_zone+0x2c3/0xbc0
    .....
    Call Trace:
    [] dump_stack+0x68/0x94
    [] print_trailer+0x115/0x1a0
    [] object_err+0x34/0x40
    [] kasan_report_error+0x217/0x530
    [] __asan_report_load8_noabort+0x43/0x50
    [] xfs_destroy_ioend+0x3bf/0x4c0
    [] xfs_end_bio+0x154/0x220
    [] bio_endio+0x158/0x1b0
    [] blk_update_request+0x18b/0xb80
    [] scsi_end_request+0x97/0x5a0
    [] scsi_io_completion+0x438/0x1690
    [] scsi_finish_command+0x375/0x4e0
    [] scsi_softirq_done+0x280/0x340

    Where the access is occuring during IO completion after the buffer
    had been freed from direct memory reclaim.

    Prevent use-after-free accidents in this end_io processing loop by
    pre-calculating the loop conditionals before calling bh->b_end_io().
    The loop is already limited to just the bufferheads covered by the
    IO in progress, so the offset checks are sufficient to prevent
    accessing buffers in the chain after end_page_writeback() has been
    called by the the bh->b_end_io() callout.

    Yet another example of why Bufferheads Must Die.

    cc: # 4.7
    Signed-off-by: Dave Chinner
    Reported-and-Tested-by: Calvin Owens
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • XFS has had scattered reports of delalloc blocks present at
    ->releasepage() time. This results in a warning with a stack trace
    similar to the following:

    ...
    Call Trace:
    [] dump_stack+0x63/0x84
    [] warn_slowpath_common+0x97/0xe0
    [] warn_slowpath_null+0x1a/0x20
    [] xfs_vm_releasepage+0x10f/0x140
    [] ? page_mkclean_one+0xd0/0xd0
    [] ? anon_vma_prepare+0x150/0x150
    [] try_to_release_page+0x32/0x50
    [] shrink_active_list+0x3ce/0x3e0
    [] shrink_lruvec+0x687/0x7d0
    [] shrink_zone+0xdc/0x2c0
    [] kswapd+0x4f9/0x970
    [] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
    [] kthread+0xc9/0xe0
    [] ? kthread_stop+0x100/0x100
    [] ret_from_fork+0x3f/0x70
    [] ? kthread_stop+0x100/0x100

    This occurs because it is possible for shrink_active_list() to send
    pages marked dirty to ->releasepage() when certain buffer_head threshold
    conditions are met. shrink_active_list() doesn't check the page dirty
    state apparently to handle an old ext3 corner case where in some cases
    clean pages would not have the dirty bit cleared, thus it is up to the
    filesystem to determine how to handle the page.

    XFS currently handles the delalloc case properly, but this behavior
    makes the warning spurious. Update the XFS ->releasepage() handler to
    explicitly skip dirty pages. Retain the existing delalloc/unwritten
    checks so we continue to warn if such buffers exist on clean pages when
    they shouldn't.

    Diagnosed-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

20 Jul, 2016

2 commits


21 Jun, 2016

2 commits


08 Jun, 2016

2 commits


27 May, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "A pretty average collection of fixes, cleanups and improvements in
    this request.

    Summary:
    - fixes for mount line parsing, sparse warnings, read-only compat
    feature remount behaviour
    - allow fast path symlink lookups for inline symlinks.
    - attribute listing cleanups
    - writeback goes direct to bios rather than indirecting through
    bufferheads
    - transaction allocation cleanup
    - optimised kmem_realloc
    - added configurable error handling for metadata write errors,
    changed default error handling behaviour from "retry forever" to
    "retry until unmount then fail"
    - fixed several inode cluster writeback lookup vs reclaim race
    conditions
    - fixed inode cluster writeback checking wrong inode after lookup
    - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
    - cleaned up inode reclaim tagging"

    * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
    xfs: fix warning in xfs_finish_page_writeback for non-debug builds
    xfs: move reclaim tagging functions
    xfs: simplify inode reclaim tagging interfaces
    xfs: rename variables in xfs_iflush_cluster for clarity
    xfs: xfs_iflush_cluster has range issues
    xfs: mark reclaimed inodes invalid earlier
    xfs: xfs_inode_free() isn't RCU safe
    xfs: optimise xfs_iext_destroy
    xfs: skip stale inodes in xfs_iflush_cluster
    xfs: fix inode validity check in xfs_iflush_cluster
    xfs: xfs_iflush_cluster fails to abort on error
    xfs: remove xfs_fs_evict_inode()
    xfs: add "fail at unmount" error handling configuration
    xfs: add configuration handlers for specific errors
    xfs: add configuration of error failure speed
    xfs: introduce table-based init for error behaviors
    xfs: add configurable error support to metadata buffers
    xfs: introduce metadata IO error class
    xfs: configurable error behavior via sysfs
    xfs: buffer ->bi_end_io function requires irq-safe lock
    ...

    Linus Torvalds
     

20 May, 2016

2 commits


02 May, 2016

1 commit


06 Apr, 2016

4 commits

  • Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
    that returns a transaction with all the required log and block reservations,
    and which allows passing transaction flags directly to avoid the cumbersome
    _xfs_trans_alloc interface.

    While we're at it we also get rid of the transaction type argument that has
    been superflous since we stopped supporting the non-CIL logging mode. The
    guts of it will be removed in another patch.

    [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This patch implements two closely related changes: First it embeds
    a bio the ioend structure so that we don't have to allocate one
    separately. Second it uses the block layer bio chaining mechanism
    to chain additional bios off this first one if needed instead of
    manually accounting for multiple bio completions in the ioend
    structure. Together this removes a memory allocation per ioend and
    greatly simplifies the ioend setup and I/O completion path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Completion of an ioend requires us to walk the bufferhead list to
    end writback on all the bufferheads. This, in turn, is needed so
    that we can end writeback on all the pages we just did IO on.

    To remove our dependency on bufferheads in writeback, we need to
    turn this around the other way - we need to walk the pages we've
    just completed IO on, and then walk the buffers attached to the
    pages and complete their IO. In doing this, we remove the
    requirement for the ioend to track bufferheads directly.

    To enable IO completion to walk all the pages we've submitted IO on,
    we need to keep the bios that we used for IO around until the ioend
    has been completed. We can do this simply by chaining the bios to
    the ioend at completion time, and then walking their pages directly
    just before destroying the ioend.

    Signed-off-by: Dave Chinner
    [hch: changed the xfs_finish_page_writeback calling convention]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Currently adding a buffer to the ioend and then building a bio from
    the buffer list are two separate operations. We don't build the bios
    and submit them until the ioend is submitted, and this places a
    fixed dependency on bufferhead chaining in the ioend.

    The first step to removing the bufferhead chaining in the ioend is
    on the IO submission side. We can build the bio directly as we add
    the buffers to the ioend chain, thereby removing the need for a
    latter "buffer-to-bio" submission loop. This allows us to submit
    bios on large ioends as soon as we cannot add more data to the bio.

    These bios then get captured by the active plug, and hence will be
    dispatched as soon as either the plug overflows or we schedule away
    from the writeback context. This will reduce submission latency for
    large IOs, but will also allow more timely request queue based
    writeback blocking when the device becomes congested.

    Signed-off-by: Dave Chinner
    [hch: various small updates]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

16 Mar, 2016

2 commits

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Mar, 2016

2 commits

  • Dave Chinner
     
  • Add a DEBUG mode-only sysfs knob to enable forced buffered write
    failure. An additional side effect of this mode is brute force killing
    of delayed allocation blocks in the range of the write. The latter is
    the prime motiviation behind this patch, as userspace test
    infrastructure requires a reliable mechanism to create and split
    delalloc extents without causing extent conversion.

    Certain fallocate operations (i.e., zero range) were used for this in
    the past, but the implementations have changed such that delalloc
    extents are flushed and converted to real blocks, rendering the test
    useless.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     

07 Mar, 2016

3 commits


28 Feb, 2016

2 commits

  • Previously calls to dax_writeback_mapping_range() for all DAX filesystems
    (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

    dax_writeback_mapping_range() needs a struct block_device, and it used
    to get that from inode->i_sb->s_bdev. This is correct for normal inodes
    mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
    block devices and for XFS real-time files.

    Instead, call dax_writeback_mapping_range() directly from the filesystem
    ->writepages function so that it can supply us with a valid block
    device. This also fixes DAX code to properly flush caches in response
    to sync(2).

    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • dax_clear_blocks() needs a valid struct block_device and previously it
    was using inode->i_sb->s_bdev in all cases. This is correct for normal
    inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
    DAX raw block devices and for XFS real-time devices.

    Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
    its arguments to take a bdev and a sector instead of an inode and a
    block. This better reflects what the function does, and it allows the
    filesystem and raw block device code to pass in an appropriate struct
    block_device.

    Signed-off-by: Ross Zwisler
    Suggested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Theodore Ts'o
    Cc: Al Viro
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

15 Feb, 2016

1 commit

  • Currently we can build a long ioend chain during ->writepages that
    gets attached to the writepage context. IO submission only then
    occurs when we finish all the writepage processing. This means we
    can have many ioends allocated and pending, and this violates the
    mempool guarantees that we need to give about forwards progress.
    i.e. we really should only have one ioend being built at a time,
    otherwise we may drain the mempool trying to allocate a new ioend
    and that blocks submission, completion and freeing of ioends that
    are already in progress.

    To prevent this situation from happening, we need to submit ioends
    for IO as soon as they are ready for dispatch rather than queuing
    them for later submission. This means the ioends have bios built
    immediately and they get queued on any plug that is current active.
    Hence if we schedule away from writeback, the ioends that have been
    built will make forwards progress due to the plug flushing on
    context switch. This will also prevent context switches from
    creating unnecessary IO submission latency.

    We can't completely avoid having nested IO allocation - when we have
    a block size smaller than a page size, we still need to hold the
    ioend submission until after we have marked the current page dirty.
    Hence we may need multiple ioends to be held while the current page
    is completely mapped and made ready for IO dispatch. We cannot avoid
    this problem - the current code already has this ioend chaining
    within a page so we can mostly ignore that it occurs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner