05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Mar, 2016

1 commit

  • Pull more nfsd updates from Bruce Fields:
    "Apologies for the previous request, which omitted the top 8 commits
    from my for-next branch (including the SCSI layout commits). Thanks
    to Trond for spotting my error!"

    This actually includes the new layout types, so here's that part of
    the pull message repeated:

    "Support for a new pnfs layout type from Christoph Hellwig. The new
    layout type is a variant of the block layout which uses SCSI features
    to offer improved fencing and device identification.

    Note this pull request also includes the client side of SCSI layout,
    with Trond's permission"

    * tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux:
    nfsd: use short read as well as i_size to set eof
    nfsd: better layoutupdate bounds-checking
    nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK
    nfsd: add SCSI layout support
    nfsd: move some blocklayout code
    nfsd: add a new config option for the block layout driver
    nfs/blocklayout: add SCSI layout support
    nfs4.h: add SCSI layout definitions

    Linus Torvalds
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits

  • This is a simple extension to the block layout driver to use SCSI
    persistent reservations for access control and fencing, as well as
    SCSI VPD pages for device identification.

    For this we need to pass the nfs4_client to the proc_getdeviceinfo method
    to generate the reservation key, and add a new fence_client method
    to allow for fence actions in the layout driver.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • Split the config symbols into a generic pNFS one, which is invisible
    and gets selected by the layout drivers, and one for the block layout
    driver.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     

16 Mar, 2016

2 commits

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Mar, 2016

7 commits

  • Dave Chinner
     
  • xfs_dir2_node_trim_free can return with setting the rvalp argument
    pointer. Initialize it to 0 at the beginning of the function and
    only update it to 1 if we succeeded trimming a freespace block.

    Reported-by: Dan Carpenter
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Carlos Maiolino
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • __xfs_trans_roll() can return without setting the
    *committed argument; this was a problem for xfs_bmap_finish():

    int committed;/* xact committed or not */
    ...
    error = __xfs_trans_roll(tp, ip, &committed);
    if (error) {
    ...
    if (committed) {

    and we tested an uninitialized "committed" variable on the
    error path. No caller is preserving "committed" state across
    calls to __xfs_trans_roll(), so just initialize committed inside
    the function to avoid future errors like this.

    Reported-by: Dan Carpenter
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • xfs_bmap_del_extent() handles extent removal from the in-core and
    on-disk extent lists. When removing a delalloc range, it updates the
    indirect block reservation appropriately based on the removal. It
    currently enforces that the new indirect block reservation is less than
    or equal to the original. This is normally the case in all situations
    except for in certain cases when the removed range creates a hole in a
    single delalloc extent, thus splitting a single delalloc extent in two.

    It is possible with small enough extents to split an indlen==1 extent
    into two such slightly smaller extents. This leaves one extent with 0
    indirect blocks and leads to assert failures in other areas (e.g.,
    xfs_bunmapi() if the extent happens to be removed).

    Update the indlen distribution code to steal blocks from the deleted
    extent, if necessary, to satisfy the worst case total indirect
    reservation for the new extents. This is safe as the caller does not
    update the fdblocks counters until the extent is removed. Blocks stolen
    in this manner simply remain accounted as allocated, having ownership
    transferred from the data extent to an indirect reservation.

    As a precaution, fall back to the original reservation algorithm if the
    new indlen requirement is not met and warn if we end up with extents
    without any reservation at all to detect this more easily in the future.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The delayed allocation indirect reservation splitting code is not
    sufficient in some cases where a delalloc extent is split in two. In
    preparation for enhancements to this code, refactor the current indlen
    distribution algorithm into a new helper function.

    [dchinner: rename temp, temp2 variables]

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • xfs_bunmapi() currently updates the fdblocks counter, unreserves quota,
    etc. before the extent is deleted by xfs_bmap_del_extent(). The function
    has problems dividing up the indirect reserved blocks for scenarios
    where a single delalloc extent is split in two. Particularly, there
    aren't always enough blocks reserved for multiple extents in a single
    extent reservation.

    The solution to this problem is to allow the extent removal code to
    steal from the deleted extent to meet indirect reservation requirements.
    Move the block of code in xfs_bmapi() that updates the fdblocks counter
    to after the call to xfs_bmap_del_extent() to allow the codepath to
    update the extent record before the free blocks are accounted. Also,
    reshuffle the code slightly so the delalloc accounting occurs near the
    xfs_bmap_del_extent() call to provide context for the comments.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Add a DEBUG mode-only sysfs knob to enable forced buffered write
    failure. An additional side effect of this mode is brute force killing
    of delayed allocation blocks in the range of the write. The latter is
    the prime motiviation behind this patch, as userspace test
    infrastructure requires a reliable mechanism to create and split
    delalloc extents without causing extent conversion.

    Certain fallocate operations (i.e., zero range) were used for this in
    the past, but the implementations have changed such that delalloc
    extents are flushed and converted to real blocks, rendering the test
    useless.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     

12 Mar, 2016

1 commit

  • Pull xfs fixes from Dave Chinner:
    "This is a fix for a regression introduced in 4.5-rc1 by the new torn
    log write detection code. The regression only affects people moving a
    clean filesystem between machines/kernels of different architecture
    (such as changing between 32 bit and 64 bit kernels), but this is the
    recommended (and only!) safe way to migrate a filesystem between
    architectures so we really need to ensure it works.

    The changes are larger than I'd prefer right at the end of the release
    cycle, but the majority of the change is just factoring code to enable
    the detection of a clean log at the correct time to avoid this issue.

    Changes:

    - Only perform torn log write detection on dirty logs. This prevents
    failures being detected due to a clean filesystem being moved
    between machines or kernels of different architectures (e.g. 32 ->
    64 bit, BE -> LE, etc). This fixes a regression introduced by the
    torn log write detection in 4.5-rc1"

    * tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: only run torn log write detection on dirty logs
    xfs: refactor in-core log state update to helper
    xfs: refactor unmount record detection into helper
    xfs: separate log head record discovery from verification

    Linus Torvalds
     

09 Mar, 2016

3 commits


07 Mar, 2016

18 commits

  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • We need to create a new ioend if the current writepage call isn't
    logically contiguous with the range contained in the previous ioend.
    Hopefully writepage gets called in order of increasing file offset.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Use named array initializers for the string arrays used to dump log
    items, rather than depending on the order being maintained correctly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Commit 88740da18[1] introduced a function to compute the maximum
    height of the inode btree back in 1994. Back then, apparently, the
    freespace and inode btrees shared the same geometry; however, it has
    long since been the case that the inode and freespace btrees have
    different record and key sizes. Therefore, we must use m_inobt_mnr if
    we want a correct calculation/log reservation/etc.

    (Yes, this bug has been around for 21 years and ten months.)

    (Yes, I was in middle school when this bug was committed.)

    [1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd

    Historical-research-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • If a crash occurs immediately after a filesystem grow operation, the
    updated superblock geometry is found only in the log. After we
    recover the log, the superblock is reread and re-initialised and so
    has the new geometry in memory. If the new geometry has more AGs
    than prior to the grow operation, then the new AGs will not have
    in-memory xfs_perag structurea associated with them.

    This will result in an oops when the first metadata buffer from a
    new AG is looked up in the buffer cache, as the block lies within
    the new geometry but then fails to find a perag structure on lookup.
    This is easily fixed by simply re-initialising the perag structure
    after re-reading the superblock at the conclusion of the first pahse
    of log recovery.

    This, however, does not fix the case of log recovery requiring
    access to metadata in the newly grown space. Fortunately for us,
    because the in-core superblock has not been updated, this will
    result in detection of access beyond the end of the filesystem
    and so recovery will fail at that point. If this proves to be
    a problem, then we can address it separately to the current
    reported issue.

    Reported-by: Alex Lyakas
    Tested-by: Alex Lyakas
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • XFS uses CRC verification over a sub-range of the head of the log to
    detect and handle torn writes. This torn log write detection currently
    runs unconditionally at mount time, regardless of whether the log is
    dirty or clean. This is problematic in cases where a filesystem might
    end up being moved across different, incompatible (i.e., opposite
    byte-endianness) architectures.

    The problem lies in the fact that log data is not necessarily written in
    an architecture independent format. For example, certain bits of data
    are written in native endian format. Further, the size of certain log
    data structures differs (i.e., struct xlog_rec_header) depending on the
    word size of the cpu. This leads to false positive crc verification
    errors and ultimately failed mounts when a cleanly unmounted filesystem
    is mounted on a system with an incompatible architecture from data that
    was written near the head of the log.

    Update the log head/tail discovery code to run torn write detection only
    when the log is not clean. This means something other than an unmount
    record resides at the head of the log and log recovery is imminent. It
    is a requirement to run log recovery on the same type of host that had
    written the content of the dirty log and therefore CRC failures are
    legitimate corruptions in that scenario.

    Reported-by: Jan Beulich
    Tested-by: Jan Beulich
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Once the record at the head of the log is identified and verified, the
    in-core log state is updated based on the record. This includes
    information such as the current head block and cycle, the start block of
    the last record written to the log, the tail lsn, etc.

    Once torn write detection is conditional, this logic will need to be
    reused. Factor the code to update the in-core log data structures into a
    new helper function. This patch does not change behavior.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Once the mount sequence has identified the head and tail blocks of the
    physical log, the record at the head of the log is located and examined
    for an unmount record to determine if the log is clean. This currently
    occurs after torn write verification of the head region of the log.

    This must ultimately be separated from torn write verification and may
    need to be called again if the log head is walked back due to a torn
    write (to determine whether the new head record is an unmount record).
    Separate this logic into a new helper function. This patch does not
    change behavior.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The code that locates the log record at the head of the log is buried in
    the log head verification function. This is fine when torn write
    verification occurs unconditionally, but this behavior is problematic
    for filesystems that might be moved across systems with different
    architectures.

    In preparation for separating examination of the log head for unmount
    records from torn write detection, lift the record location logic out of
    the log verification function and into the caller. This patch does not
    change behavior.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

02 Mar, 2016

3 commits

  • Just use the t_blk_res field directly instead of obsfucating the reference
    by a macro.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • inode32/inode64 allocator behavior with respect to mount, remount
    and growfs is a little tricky.

    The inode32 mount option should only enable the inode32 allocator
    heuristics if the filesystem is large enough for 64-bit inodes to
    exist. Today, it has this behavior on the initial mount, but a
    remount with inode32 unconditionally changes the allocation
    heuristics, even for a small fs.

    Also, an inode32 mounted small filesystem should transition to the
    inode32 allocator if the filesystem is subsequently grown to a
    sufficient size. Today that does not happen.

    This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
    single new function, and moves the "is the maximum inode number big
    enough to matter" test into that function, so it doesn't rely on the
    caller to get it right - which remount did not do, previously.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • busyp->bno is printed with a %llu format specifier when the
    intention is to print a hexadecimal value. Trivial fix to
    use %llx instead. Found with smatch static analysis:

    fs/xfs/xfs_discard.c:229 xfs_discard_extents() warn: '0x'
    prefix is confusing together with '%llu' specifier

    Signed-off-by: Colin Ian King
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Colin Ian King