24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

3 commits

  • To properly support the new DAX fsync/msync infrastructure filesystems
    need to call dax_pfn_mkwrite() so that DAX can track when user pages are
    dirtied.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull more xfs updates from Dave Chinner:
    "This is the second update for XFS that I mentioned in the original
    pull request last week.

    It contains a revert for a suspend regression in 4.4 and a fix for a
    long standing log recovery issue that has been further exposed by all
    the log recovery changes made in the original 4.5 merge.

    There is one more thing in this pull request - one that I forgot to
    merge into the origin. That is, pulling the XFS_IOC_FS[GS]ETXATTR
    ioctl up to the VFS level so that other filesystems can also use it
    for modifying project quota IDs

    Summary:

    - promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
    it can be shared with other filesystems. The ext4 project quota
    functionality is the first target for this. The commits in this
    series have not been updated with review or final SOB tags because
    the branch they were originally published in was needed by ext4.
    Those tags are:

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Dave Chinner

    - Revert a change that is causing suspend failures.

    - Fix a use-after-free that can occur on log mount failures. Been
    around forever, but now exposed by other changes to log recovery
    made in the first 4.5 merge"

    * tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: log mount failures don't wait for buffers to be released
    Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
    xfs: introduce per-inode DAX enablement
    xfs: use FS_XFLAG definitions directly
    fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion

    Linus Torvalds
     

19 Jan, 2016

4 commits

  • Dave Chinner
     
  • Recently I've been seeing xfs/051 fail on 1k block size filesystems.
    Trying to trace the events during the test lead to the problem going
    away, indicating that it was a race condition that lead to this
    ASSERT failure:

    XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
    .....
    [] xfs_free_perag+0x87/0xb0
    [] xfs_mountfs+0x4d9/0x900
    [] xfs_fs_fill_super+0x3bf/0x4d0
    [] mount_bdev+0x180/0x1b0
    [] xfs_fs_mount+0x15/0x20
    [] mount_fs+0x38/0x170
    [] vfs_kern_mount+0x67/0x120
    [] do_mount+0x218/0xd60
    [] SyS_mount+0x8b/0xd0

    When I finally caught it with tracing enabled, I saw that AG 2 had
    an elevated reference count and a buffer was responsible for it. I
    tracked down the specific buffer, and found that it was missing the
    final reference count release that would put it back on the LRU and
    hence be found by xfs_wait_buftarg() calls in the log mount failure
    handling.

    The last four traces for the buffer before the assert were (trimmed
    for relevance)

    kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
    kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
    mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
    mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC

    This is an async write that is completing, so there's nobody waiting
    for it directly. Hence we call xfs_buf_relse() once all the
    processing is complete. That does:

    static inline void xfs_buf_relse(xfs_buf_t *bp)
    {
    xfs_buf_unlock(bp);
    xfs_buf_rele(bp);
    }

    Now, it's clear that mount is waiting on the buffer lock, and that
    it has been released by xfs_buf_relse() and gained by mount. This is
    expected, because at this point the mount process is in
    xfs_buf_delwri_submit() waiting for all the IO it submitted to
    complete.

    The mount process, however, is waiting on the lock for the buffer
    because it is in xfs_buf_delwri_submit(). This waits for IO
    completion, but it doesn't wait for the buffer reference owned by
    the IO to go away. The mount process collects all the completions,
    fails the log recovery, and the higher level code then calls
    xfs_wait_buftarg() to free all the remaining buffers in the
    filesystem.

    The issue is that on unlocking the buffer, the scheduler has decided
    that the mount process has higher priority than the the kworker
    thread that is running the IO completion, and so immediately
    switched contexts to the mount process from the semaphore unlock
    code, hence preventing the kworker thread from finishing the IO
    completion and releasing the IO reference to the buffer.

    Hence by the time that xfs_wait_buftarg() is run, the buffer still
    has an active reference and so isn't on the LRU list that the
    function walks to free the remaining buffers. Hence we miss that
    buffer and continue onwards to tear down the mount structures,
    at which time we get find a stray reference count on the perag
    structure. On a non-debug kernel, this will be ignored and the
    structure torn down and freed. Hence when the kworker thread is then
    rescheduled and the buffer released and freed, it will access a
    freed perag structure.

    The problem here is that when the log mount fails, we still need to
    quiesce the log to ensure that the IO workqueues have returned to
    idle before we run xfs_wait_buftarg(). By synchronising the
    workqueues, we ensure that all IO completions are fully processed,
    not just to the point where buffers have been unlocked. This ensures
    we don't end up in the situation above.

    cc: # 3.18
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • This reverts commit 24ba16bb3d499c49974669cd8429c3e4138ab102 as it
    prevents machines from suspending. This regression occurs when the
    xfsaild is idle on entry to suspend, and so there s no activity to
    wake it from it's idle sleep and hence see that it is supposed to
    freeze. Hence the freezer times out waiting for it and suspend is
    cancelled.

    There is no obvious fix for this short of freezing the filesystem
    properly, so revert this change for now.

    cc: # 4.4
    Signed-off-by: Dave Chinner
    Acked-by: Jiri Kosina
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Dave Chinner
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Jan, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's not a lot in this - the main addition is the CRC validation of
    the entire region of the log that the will be recovered, along with
    several log recovery fixes. Most of the rest is small bug fixes and
    cleanups.

    I have three bug fixes still pending, all that address recently fixed
    regressions that I will send to next week after they've had some time
    in for-next.

    Summary:
    - extensive CRC validation during log recovery
    - several log recovery bug fixes
    - Various DAX support fixes
    - AGFL size calculation fix
    - various cleanups in preparation for new functionality
    - project quota ENOSPC notification via netlink
    - tracing and debug improvements"

    * tag 'xfs-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (26 commits)
    xfs: handle dquot buffer readahead in log recovery correctly
    xfs: inode recovery readahead can race with inode buffer creation
    xfs: eliminate committed arg from xfs_bmap_finish
    xfs: bmapbt checking on debug kernels too expensive
    xfs: add tracepoints to readpage calls
    xfs: debug mode log record crc error injection
    xfs: detect and trim torn writes during log recovery
    xfs: fix recursive splice read locking with DAX
    xfs: Don't use reserved blocks for data blocks with DAX
    XFS: Use a signed return type for suffix_kstrtoint()
    libxfs: refactor short btree block verification
    libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
    libxfs: use a convenience variable instead of open-coding the fork
    xfs: fix log ticket type printing
    libxfs: make xfs_alloc_fix_freelist non-static
    xfs: make xfs_buf_ioend_async() static
    xfs: send warning of project quota to userspace via netlink
    xfs: get mp from bma->ip in xfs_bmap code
    xfs: print name of verifier if it fails
    libxfs: Optimize the loop for xfs_bitmap_empty
    ...

    Linus Torvalds
     

13 Jan, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of stuff. That probably should've been 5 or 6 separate
    branches, but by the time I'd realized how large and mixed that bag
    had become it had been too close to -final to play with rebasing.

    Some fs/namei.c cleanups there, memdup_user_nul() introduction and
    switching open-coded instances, burying long-dead code, whack-a-mole
    of various kinds, several new helpers for ->llseek(), assorted
    cleanups and fixes from various people, etc.

    One piece probably deserves special mention - Neil's
    lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
    called without ->i_mutex and tries to avoid ever taking it. That, of
    course, means that it's not useful for any directory modifications,
    but things like getting inode attributes in nfds readdirplus are fine
    with that. I really should've asked for moratorium on lookup-related
    changes this cycle, but since I hadn't done that early enough... I
    *am* asking for that for the coming cycle, though - I'm going to try
    and get conversion of i_mutex to rwsem with ->lookup() done under lock
    taken shared.

    There will be a patch closer to the end of the window, along the lines
    of the one Linus had posted last May - mechanical conversion of
    ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
    inode_is_locked()/inode_lock_nested(). To quote Linus back then:

    -----
    | This is an automated patch using
    |
    | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    | with a very few manual fixups
    -----

    I'm going to send that once the ->i_mutex-affecting stuff in -next
    gets mostly merged (or when Linus says he's about to stop taking
    merges)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    nfsd: don't hold i_mutex over userspace upcalls
    fs:affs:Replace time_t with time64_t
    fs/9p: use fscache mutex rather than spinlock
    proc: add a reschedule point in proc_readfd_common()
    logfs: constify logfs_block_ops structures
    fcntl: allow to set O_DIRECT flag on pipe
    fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
    fs: xattr: Use kvfree()
    [s390] page_to_phys() always returns a multiple of PAGE_SIZE
    nbd: use ->compat_ioctl()
    fs: use block_device name vsprintf helper
    lib/vsprintf: add %*pg format specifier
    fs: use gendisk->disk_name where possible
    poll: plug an unused argument to do_poll
    amdkfd: don't open-code memdup_user()
    cdrom: don't open-code memdup_user()
    rsxx: don't open-code memdup_user()
    mtip32xx: don't open-code memdup_user()
    [um] mconsole: don't open-code memdup_user_nul()
    [um] hostaudio: don't open-code memdup_user()
    ...

    Linus Torvalds
     

12 Jan, 2016

4 commits

  • Pull vfs xattr updates from Al Viro:
    "Andreas' xattr cleanup series.

    It's a followup to his xattr work that went in last cycle; -0.5KLoC"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    xattr handlers: Simplify list operation
    ocfs2: Replace list xattr handler operations
    nfs: Move call to security_inode_listsecurity into nfs_listxattr
    xfs: Change how listxattr generates synthetic attributes
    tmpfs: listxattr should include POSIX ACL xattrs
    tmpfs: Use xattr handler infrastructure
    btrfs: Use xattr handler infrastructure
    vfs: Distinguish between full xattr names and proper prefixes
    posix acls: Remove duplicate xattr name definitions
    gfs2: Remove gfs2_xattr_acl_chmod
    vfs: Remove vfs_xattr_cmp

    Linus Torvalds
     
  • Dave Chinner
     
  • When we do dquot readahead in log recovery, we do not use a verifier
    as the underlying buffer may not have dquots in it. e.g. the
    allocation operation hasn't yet been replayed. Hence we do not want
    to fail recovery because we detect an operation to be replayed has
    not been run yet. This problem was addressed for inodes in commit
    d891400 ("xfs: inode buffers may not be valid during recovery
    readahead") but the problem was not recognised to exist for dquots
    and their buffers as the dquot readahead did not have a verifier.

    The result of not using a verifier is that when the buffer is then
    next read to replay a dquot modification, the dquot buffer verifier
    will only be attached to the buffer if *readahead is not complete*.
    Hence we can read the buffer, replay the dquot changes and then add
    it to the delwri submission list without it having a verifier
    attached to it. This then generates warnings in xfs_buf_ioapply(),
    which catches and warns about this case.

    Fix this and make it handle the same readahead verifier error cases
    as for inode buffers by adding a new readahead verifier that has a
    write operation as well as a read operation that marks the buffer as
    not done if any corruption is detected. Also make sure we don't run
    readahead if the dquot buffer has been marked as cancelled by
    recovery.

    This will result in readahead either succeeding and the buffer
    having a valid write verifier, or readahead failing and the buffer
    state requiring the subsequent read to resubmit the IO with the new
    verifier. In either case, this will result in the buffer always
    ending up with a valid write verifier on it.

    Note: we also need to fix the inode buffer readahead error handling
    to mark the buffer with EIO. Brian noticed the code I copied from
    there wrong during review, so fix it at the same time. Add comments
    linking the two functions that handle readahead verifier errors
    together so we don't forget this behavioural link in future.

    cc: # 3.12 - current
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • When we do inode readahead in log recovery, we do can do the
    readahead before we've replayed the icreate transaction that stamps
    the buffer with inode cores. The inode readahead verifier catches
    this and marks the buffer as !done to indicate that it doesn't yet
    contain valid inodes.

    In adding buffer error notification (i.e. setting b_error = -EIO at
    the same time as as we clear the done flag) to such a readahead
    verifier failure, we can then get subsequent inode recovery failing
    with this error:

    XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

    This occurs when readahead completion races with icreate item replay
    such as:

    inode readahead
    find buffer
    lock buffer
    submit RA io
    ....
    icreate recovery
    xfs_trans_get_buffer
    find buffer
    lock buffer

    .....

    fails verifier
    clear XBF_DONE
    set bp->b_error = -EIO
    release and unlock buffer

    icreate initialises buffer
    marks buffer as done
    adds buffer to delayed write queue
    releases buffer

    At this point, we have an initialised inode buffer that is up to
    date but has an -EIO state registered against it. When we finally
    get to recovering an inode in that buffer:

    inode item recovery
    xfs_trans_read_buffer
    find buffer
    lock buffer
    sees XBF_DONE is set, returns buffer
    sees bp->b_error is set
    fail log recovery!

    Essentially, we need xfs_trans_get_buf_map() to clear the error status of
    the buffer when doing a lookup. This function returns uninitialised
    buffers, so the buffer returned can not be in an error state and
    none of the code that uses this function expects b_error to be set
    on return. Indeed, there is an ASSERT(!bp->b_error); in the
    transaction case in xfs_trans_get_buf_map() that would have caught
    this if log recovery used transactions....

    This patch firstly changes the inode readahead failure to set -EIO
    on the buffer, and secondly changes xfs_buf_get_map() to never
    return a buffer with an error state set so this first change doesn't
    cause unexpected log recovery failures.

    cc: # 3.12 - current
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

11 Jan, 2016

1 commit

  • Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
    associated comments were replicated several times across
    the attribute code, all dealing with what to do if the
    transaction was or wasn't committed.

    And in that replicated code, an ASSERT() test of an
    uninitialized variable occurs in several locations:

    error = xfs_attr_thing(&args);
    if (!error) {
    error = xfs_bmap_finish(&args.trans, args.flist,
    &committed);
    }
    if (error) {
    ASSERT(committed);

    If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
    never set "committed", and then test it in the ASSERT.

    Fix this up by moving the committed state internal to xfs_bmap_finish,
    and add a new inode argument. If an inode is passed in, it is passed
    through to __xfs_trans_roll() and joined to the transaction there if
    the transaction was committed.

    xfs_qm_dqalloc() was a little unique in that it called bjoin rather
    than ijoin, but as Dave points out we can detect the committed state
    but checking whether (*tpp != tp).

    Addresses-Coverity-Id: 102360
    Addresses-Coverity-Id: 102361
    Addresses-Coverity-Id: 102363
    Addresses-Coverity-Id: 102364
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

08 Jan, 2016

2 commits

  • For large sparse or fragmented files, checking every single entry in
    the bmapbt on every operation is prohibitively expensive. Especially
    as such checks rarely discover problems during normal operations on
    high extent coutn files. Our regression tests don't tend to exercise
    files with hundreds of thousands to millions of extents, so mostly
    this isn't noticed.

    However, trying to run things like xfs_mdrestore of large filesystem
    dumps on a debug kernel quickly becomes impossible as the CPU is
    completely burnt up repeatedly walking the sparse file bmapbt that
    is generated for every allocation that is made.

    Hence, if the file has more than 10,000 extents, just don't bother
    with walking the tree to check it exhaustively. The btree code has
    checks that ensure that the newly inserted/removed/modified record
    is correctly ordered, so the entrie tree walk in thses cases has
    limited additional value.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • This allows us to see page cache driven readahead in action as it
    passes through XFS. This helps to understand buffered read
    throughput problems such as readahead IO IO sizes being too small
    for the underlying device to reach max throughput.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

07 Jan, 2016

1 commit


05 Jan, 2016

4 commits

  • Dave Chinner
     
  • Dave Chinner
     
  • XFS now uses CRC verification over a limited section of the log to
    detect torn writes prior to a crash. This is difficult to test directly
    due to the timing and hardware requirements to cause a short write.

    Add a mechanism to inject CRC errors into log records to facilitate
    testing torn write detection during log recovery. This mechanism is
    dangerous and can result in filesystem corruption. Thus, it is only
    available in DEBUG mode for testing/development purposes. Set a non-zero
    value to the following sysfs entry to enable error injection:

    /sys/fs/xfs//log/log_badcrc_factor

    Once enabled, XFS intentionally writes an invalid CRC to a log record at
    some random point in the future based on the provided frequency. The
    filesystem immediately shuts down once the record has been written to
    the physical log to prevent metadata writeback (e.g., AIL insertion)
    once the log write completes. This helps reasonably simulate a torn
    write to the log as the affected record must be safe to discard. The
    next mount after the intentional shutdown requires log recovery and
    should detect and recover from the torn write.

    Note again that this _will_ result in data loss or worse. For testing
    and development purposes only!

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Certain types of storage, such as persistent memory, do not provide
    sector atomicity for writes. This means that if a crash occurs while XFS
    is writing log records, only part of those records might make it to the
    storage. This is problematic because log recovery uses the cycle value
    packed at the top of each log block to locate the head/tail of the log.
    This can lead to CRC verification failures during log recovery and an
    unmountable fs for a filesystem that is otherwise consistent.

    Update log recovery to incorporate log record CRC verification as part
    of the head/tail discovery process. Once the head is located via the
    traditional algorithm, run a CRC-only pass over the records up to the
    head of the log. If CRC verification fails, assume that the records are
    torn as a matter of policy and trim the head block back to the start of
    the first bad record.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

04 Jan, 2016

17 commits

  • Rather than just being able to turn DAX on and off via a mount
    option, some applications may only want to enable DAX for certain
    performance critical files in a filesystem.

    This patch introduces a new inode flag to enable DAX in the v3 inode
    di_flags2 field. It adds support for setting and clearing flags in
    the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
    S_DAX inode flag appropriately when it is seen.

    When this flag is set on a directory, it acts as an "inherit flag".
    That is, inodes created in the directory will automatically inherit
    the on-disk inode DAX flag, enabling administrators to set up
    directory heirarchies that automatically use DAX. Setting this flag
    on an empty root directory will make the entire filesystem use DAX
    by default.

    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Now that the ioctls have been hoisted up to the VFS level, use
    the VFs definitions directly and remove the XFS specific definitions
    completely. Userspace is going to have to handle the change of this
    interface separately, so removing the definitions from xfs_fs.h is
    not an issue here at all.

    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from
    fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls
    can be used by all filesystems, not just XFS. This enables
    (initially) ext4 to use the ioctl to set project IDs on inodes.

    Based-on-patch-from: Li Xi
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Doing a splice read (generic/249) generates a lockdep splat because
    we recursively lock the inode iolock in this path:

    SyS_sendfile64
    do_sendfile
    do_splice_direct
    splice_direct_to_actor
    do_splice_to
    xfs_file_splice_read <<<<<< lock here
    default_file_splice_read
    vfs_readv
    do_readv_writev
    do_iter_readv_writev
    xfs_file_read_iter <<<<<< then here

    The issue here is that for DAX inodes we need to avoid the page
    cache path and hence simply push it into the normal read path.
    Unfortunately, we can't tell down at xfs_file_read_iter() whether we
    are being called from the splice path and hence we cannot avoid the
    locking at this layer. Hence we simply have to drop the inode
    locking at the higher splice layer for DAX.

    Signed-off-by: Dave Chinner
    Tested-by: Ross Zwisler
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled
    the DAX allocation call to dip into the reserve pool in case it was
    converting unwritten extents rather than allocating blocks. This was
    a direct copy of the unwritten extent conversion code, but had an
    unintended side effect of allowing normal data block allocation to
    use the reserve pool. Hence normal block allocation could deplete
    the reserve pool and prevent unwritten extent conversion at ENOSPC,
    hence violating fallocate guarantees on preallocated space.

    Fix it by checking whether the incoming map from __xfs_get_blocks()
    spans an unwritten extent and only use the reserve pool if the
    allocation covers an unwritten extent.

    Signed-off-by: Dave Chinner
    Tested-by: Ross Zwisler
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The return type "unsigned long" was used by the suffix_kstrtoint()
    function even though it will eventually return a negative error code.
    Improve this implementation detail by using the type "int" instead.

    This issue was detected by using the Coccinelle software.

    Signed-off-by: Markus Elfring
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Markus Elfring
     
  • Create xfs_btree_sblock_verify() to verify short-format btree blocks
    (i.e. the per-AG btrees with 32-bit block pointers) instead of
    open-coding them.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
    inside it, gcc will quietly round the structure size up to the nearest
    64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
    macro returning incorrect results for v5 filesystems on 64-bit
    machines (118 items instead of 119). As a result, a 32-bit xfs_repair
    will see garbage in AGFL item 119 and complain.

    Therefore, tell gcc not to pad the structure so that the AGFL size
    calculation is correct.

    cc: # 3.10 - 4.4
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Use a convenience variable instead of open-coding the inode fork.
    This isn't really needed for now, but will become important when we
    add the copy-on-write fork later.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Update the log ticket reservation type printing code to reflect
    all the types of log tickets, to avoid incorrect debug output and
    avoid running off the end of the array.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the
    static designation. xfsprogs already has this; this simply brings
    the kernel up to date.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • There are no callers of the xfs_buf_ioend_async() function outside
    of the fs/xfs/xfs_buf.c. So, let's make it static.

    Signed-off-by: Alexander Kuleshov
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Alexander Kuleshov
     
  • Linux's quota subsystem has an ability to handle project quota. This
    commit just utilizes the ability from xfs side. dbus-monitor and
    quota_nld shipped as part of quota-tools can be used for testing.
    See the patch posting on the XFS list for details on testing.

    Signed-off-by: Masatake YAMATO
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Masatake YAMATO
     
  • In my earlier commit

    c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

    I added some local mp variables with code which indicates that
    mp might be NULL. Coverity doesn't like this now, because the
    updated per-fs XFS_STATS macros dereference mp.

    I don't think this is actually a problem; from what I can tell,
    we cannot get to these functions with a null bma->tp, so my NULL
    check was probably pointless. Still, it's not super obvious.

    So switch this code to get mp from the inode on the xfs_bmalloca
    structure, with no conditional, because the functions are already
    using bmap->ip directly.

    Addresses-Coverity-Id: 1339552
    Addresses-Coverity-Id: 1339553
    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • This adds a name to each buf_ops structure, so that if
    a verifier fails we can print the type of verifier that
    failed it. Should be a slight debugging aid, I hope.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • If there is any non zero bit in a long bitmap, it can jump out of the
    loop and finish the function as soon as possible.

    Signed-off-by: Jia He
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Jia He
     
  • As part of the head/tail discovery process, log recovery locates the
    head block and then reverse seeks to find the start of the last active
    record in the log. This is non-trivial as the record itself could have
    wrapped around the end of the physical log. Log recovery torn write
    detection potentially needs to walk further behind the last record in
    the log, as multiple log I/Os can be in-flight at one time during a
    crash event.

    Therefore, refactor the reverse log record header search mechanism into
    a new helper that supports the ability to seek past an arbitrary number
    of log records (or until the tail is hit). Update the head/tail search
    mechanism to call the new helper, but otherwise there is no change in
    log recovery behavior.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster