13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

3 commits

  • Backmerge of dcache.c changes from mainline. It's that, or complete
    rebase...

    Conflicts:
    fs/splice.c

    Signed-off-by: Al Viro

    Al Viro
     
  • iter_file_splice_write() - a ->splice_write() instance that gathers the
    pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
    it to ->write_iter(). A bunch of simple cases coverted to that...

    [AV: fixed the braino spotted by Cyrill]

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull xfs updates from Dave Chinner:
    "This update contains:
    - cleanup removing unused function args
    - rework of the filestreams allocator to use dentry cache parent
    lookups
    - new on-disk free inode btree and optimised inode allocator
    - various bug fixes
    - rework of internal attribute API
    - cleanup of superblock feature bit support to remove historic cruft
    - more fixes and minor cleanups
    - added a new directory/attribute geometry abstraction
    - yet more fixes and minor cleanups"

    * tag 'xfs-for-linus-3.16-rc1' of git://oss.sgi.com/xfs/xfs: (86 commits)
    xfs: fix xfs_da_args sparse warning in xfs_readdir
    xfs: Fix rounding in xfs_alloc_fix_len()
    xfs: tone down writepage/releasepage WARN_ONs
    xfs: small cleanup in xfs_lowbit64()
    xfs: kill xfs_buf_geterror()
    xfs: xfs_readsb needs to check for magic numbers
    xfs: block allocation work needs to be kswapd aware
    xfs: remove redundant geometry information from xfs_da_state
    xfs: replace attr LBSIZE with xfs_da_geometry
    xfs: pass xfs_da_args to xfs_attr_leaf_newentsize
    xfs: use xfs_da_geometry for block size in attr code
    xfs: remove mp->m_dir_geo from directory logging
    xfs: reduce direct usage of mp->m_dir_geo
    xfs: move node entry counts to xfs_da_geometry
    xfs: convert dir/attr btree threshold to xfs_da_geometry
    xfs: convert m_dirblksize to xfs_da_geometry
    xfs: convert m_dirblkfsbs to xfs_da_geometry
    xfs: convert directory segment limits to xfs_da_geometry
    xfs: convert directory db conversion to xfs_da_geometry
    xfs: convert directory dablk conversion to xfs_da_geometry
    ...

    Linus Torvalds
     

11 Jun, 2014

1 commit

  • The kernel has no concept of capabilities with respect to inodes; inodes
    exist independently of namespaces. For example, inode_capable(inode,
    CAP_LINUX_IMMUTABLE) would be nonsense.

    This patch changes inode_capable to check for uid and gid mappings and
    renames it to capable_wrt_inode_uidgid, which should make it more
    obvious what it does.

    Fixes CVE-2014-4014.

    Cc: Theodore Ts'o
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Chinner
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

10 Jun, 2014

3 commits


06 Jun, 2014

23 commits

  • Rounding in xfs_alloc_fix_len() is wrong. As the comment states, the
    result should be a number of a form (k*prod+mod) however due to sign
    mistake the result is different. As a result allocations on raid arrays
    could be misaligned in some cases.

    This also seems to fix occasional assertion failure:
    XFS_WANT_CORRUPTED_GOTO(rlen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • I recently ran into the issue fixed by

    "xfs: kill buffers over failed write ranges properly"

    which spams the log with lots of backtraces. Make debugging any
    issues like that easier by using WARN_ON_ONCE in the writeback code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • There are two checkpatch.pl complaints here because of the bad
    indenting and because of the assignment inside the condition.

    Signed-off-by: Dan Carpenter
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dan Carpenter
     
  • Most of the callers are just calling ASSERT(!xfs_buf_geterror())
    which means they are checking for bp->b_error == 0. If bp is null in
    this case, we will assert fail, and hence it's no different in
    result to oopsing because of a null bp. In some cases, errors have
    already been checked for or the function returning the buffer can't
    return a buffer with an error, so it's just a redundant assert.
    Either way, the assert can either be removed.

    The other two non-assert callers can just test for a buffer and
    error properly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Commit daba542 ("xfs: skip verification on initial "guess"
    superblock read") dropped the use of a verifier for the initial
    superblock read so we can probe the sector size of the filesystem
    stored in the superblock. It, however, now fails to validate that
    what was read initially is actually an XFS superblock and hence will
    fail the sector size check and return ENOSYS.

    This causes probe-based mounts to fail because it expects XFS to
    return EINVAL when it doesn't recognise the superblock format.

    cc:
    Reported-by: Plamen Petrov
    Tested-by: Plamen Petrov
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Upon memory pressure, kswapd calls xfs_vm_writepage() from
    shrink_page_list(). This can result in delayed allocation occurring
    and that gets deferred to the the allocation workqueue.

    The allocation then runs outside kswapd context, which means if it
    needs memory (and it does to demand page metadata from disk) it can
    block in shrink_inactive_list() waiting for IO congestion. These
    blocking waits are normally avoiding in kswapd context, so under
    memory pressure writeback from kswapd can be arbitrarily delayed by
    memory reclaim.

    To avoid this, pass the kswapd context to the allocation being done
    by the workqueue, so that memory reclaim understands correctly that
    the work is being done for kswapd and therefore it is not blocked
    and does not delay memory reclaim.

    To avoid issues with int->char conversion of flag fields (as noticed
    in v1 of this patch) convert the flag fields in the struct
    xfs_bmalloca to bool types. pahole indicates these variables are
    still single byte variables, so no extra space is consumed by this
    change.

    cc:
    Reported-by: Tetsuo Handa
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • It's carried in state->args->geo, so there's no need to duplicate it
    and use more stack space than necessary.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • As it's only ever called from contexts where the xfs_da_args is
    present and contains all the information needed inside the args
    structure.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Rather than using the superblock value obtained through the
    xfs_mount.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • We don't pass the xfs_da_args or the geometry all the way down to
    the directory buffer logging code, hence we have to use
    mp->m_dir_geo here. Fix this to use the geometry passed via the
    xfs_da_args, and convert all the directory logging functions for
    consistency.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • There are many places in the directory code were we don't pass the
    args into and so have to extract the geometry direct from the mount
    structure. Push the args or the geometry into these leaf functions
    so that we don't need to grab it from the struct xfs_mount.

    This, in turn, brings use to the point where directory geometry is
    no longer a property of the struct xfs_mount; it is not a global
    property anymore, and hence we can start to consider per-directory
    configuration of physical geometries.

    Start by converting the xfs_dir_isblock/leaf code - pass in the
    xfs_da_args and convert the readdir code to use xfs_da_args like
    the rest of the directory code to pass information around.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • They are just simple wrappers around xfs_dir2_byte_to_db(), and
    we've already removed one usage earlier in the patch set. Kill
    the rest before we start removing the xfs_mount from conversion
    functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Because they aren't actually part of the on-disk format, and so
    shouldn't be in xfs_da_format.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The directory code has a dependency on the struct xfs_mount to
    supply the directory block geometry. Block size, block log size,
    and other parameters are pre-caclulated in the struct xfs_mount or
    access directly from the superblock embedded in the struct
    xfs_mount.

    Extract all of this geometry information out of the struct xfs_mount
    and superblock and place it into a new struct xfs_da_geometry
    defined by the directory code. Allocate and initialise it at mount
    time, and attach it to the struct xfs_mount so it canbe passed back
    into the directory code appropriately rather than using the struct
    xfs_mount.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

20 May, 2014

9 commits

  • Conflicts:
    fs/xfs/xfs_inode.c

    Dave Chinner
     
  • Conflicts:
    fs/xfs/xfs_ialloc.c

    Dave Chinner
     
  • xfs_ialloc.h:102: error: expected ',' or '...' before 'delete'

    Simple parameter rename, no changes to behaviour.

    Signed-off-by: Roger Willcocks
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Roger Willcocks
     
  • Write to a file with an offset greater than 16TB on 32-bit system and
    then trigger page write-back via sync(1) will cause task hang.

    # block_size=4096
    # offset=$(((2**32 - 1) * $block_size))
    # xfs_io -f -c "pwrite $offset $block_size" /storage/test_file
    # sync

    INFO: task sync:2590 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    sync D c1064a28 0 2590 2097 0x00000000
    .....
    Call Trace:
    [] ? ttwu_do_wakeup+0x18/0x130
    [] ? try_to_wake_up+0x1ce/0x220
    [] ? wake_up_process+0x1f/0x40
    [] ? wake_up_worker+0x1e/0x30
    [] schedule+0x23/0x60
    [] schedule_timeout+0x18d/0x1f0
    [] ? do_raw_spin_unlock+0x4e/0x90
    [] ? __queue_delayed_work+0x91/0x150
    [] ? do_raw_spin_lock+0x3f/0x100
    [] ? do_raw_spin_unlock+0x4e/0x90
    [] wait_for_completion+0x7d/0xc0
    [] ? try_to_wake_up+0x220/0x220
    [] sync_inodes_sb+0x92/0x180
    [] sync_inodes_one_sb+0x15/0x20
    [] iterate_supers+0xb8/0xc0
    [] ? fdatawrite_one_bdev+0x20/0x20
    [] sys_sync+0x31/0x80
    [] sysenter_do_call+0x12/0x28

    This issue can be triggered via xfstests/generic/308.

    The reason is that the end_index is unsigned long with maximum value
    '2^32-1=4294967295' on 32-bit platform, and the given offset cause it
    wrapped to 0, so that the following codes will repeat again and again
    until the task schedule time out:

    end_index = offset >> PAGE_CACHE_SHIFT;
    last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
    if (page->index >= end_index) {
    unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
    /*
    * Just skip the page if it is fully outside i_size, e.g. due
    * to a truncate operation that is in progress.
    */
    if (page->index >= end_index + 1 || offset_into_page == 0) {
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    unlock_page(page);
    return 0;
    }

    In order to check if a page is fully outsids i_size or not, we can fix
    the code logic as below:
    if (page->index > end_index ||
    (page->index == end_index && offset_into_page == 0))

    Secondly, there still has another similar issue when calculating the
    end offset for mapping the filesystem blocks to the file blocks for
    delalloc. With the same tests to above, run unmount(8) will cause
    kernel panic if CONFIG_XFS_DEBUG is enabled:

    XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
    ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964

    kernel BUG at fs/xfs/xfs_message.c:108!
    invalid opcode: 0000 [#1] SMP
    task: edddc100 ti: ec6ee000 task.ti: ec6ee000
    EIP: 0060:[] EFLAGS: 00010296 CPU: 1
    EIP is at assfail+0x2b/0x30 [xfs]
    ..............
    Call Trace:
    [] xfs_fs_destroy_inode+0x74/0x120 [xfs]
    [] destroy_inode+0x31/0x50
    [] evict+0xef/0x170
    [] dispose_list+0x32/0x40
    [] evict_inodes+0xca/0xe0
    [] generic_shutdown_super+0x46/0xd0
    [] kill_block_super+0x29/0x70
    [] deactivate_locked_super+0x44/0x70
    [] deactivate_super+0x47/0x60
    [] mntput_no_expire+0xcd/0x120
    [] SyS_umount+0xa8/0x370
    [] SyS_oldumount+0x1e/0x20
    [] sysenter_do_call+0x12/0x28

    That because the end_offset is evaluated to 0 which is the same reason
    to above, hence the mapping and covertion for dealloc file blocks to
    file system blocks did not happened.

    This patch just fixed both issues.

    Reported-by: Michael L. Semon
    Signed-off-by: Jie Liu
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jie Liu
     
  • All of the verification checks of magic numbers are now done by
    verifiers, so ther eis no need to check them again once the buffer
    has been successfully read. If the magic number is bad, it won't
    even get to that code to verify it so it really serves no purpose at
    all anymore. Remove it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The addition of direct formatting of log items into the CIL
    linear buffer added alignment restrictions that the start of each
    vector needed to be 64 bit aligned. Hence padding was added in
    xlog_finish_iovec() to round up the vector length to ensure the next
    vector started with the correct alignment.

    This adds a small number of bytes to the size of
    the linear buffer that is otherwise unused. The issue is that we
    then use the linear buffer size to determine the log space used by
    the log item, and this includes the unused space. Hence when we
    account for space used by the log item, it's more than is actually
    written into the iclogs, and hence we slowly leak this space.

    This results on log hangs when reserving space, with threads getting
    stuck with these stack traces:

    Call Trace:
    [] schedule+0x29/0x70
    [] xlog_grant_head_wait+0xa2/0x1a0
    [] xlog_grant_head_check+0xbd/0x140
    [] xfs_log_reserve+0x103/0x220
    [] xfs_trans_reserve+0x2f5/0x310
    .....

    The 4 bytes is significant. Brain Foster did all the hard work in
    tracking down a reproducable leak to inode chunk allocation (it went
    away with the ikeep mount option). His rough numbers were that
    creating 50,000 inodes leaked 11 log blocks. This turns out to be
    roughly 800 inode chunks or 1600 inode cluster buffers. That
    works out at roughly 4 bytes per cluster buffer logged, and at that
    I started looking for a 4 byte leak in the buffer logging code.

    What I found was that a struct xfs_buf_log_format structure for an
    inode cluster buffer is 28 bytes in length. This gets rounded up to
    32 bytes, but the vector length remains 28 bytes. Hence the CIL
    ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
    for that vector rather than 28 bytes which are written into the log.

    The fix for this problem is to separately track the bytes used by
    the log vectors in the item and use that instead of the buffer
    length when accounting for the log space that will be used by the
    formatted log item.

    Again, thanks to Brian Foster for doing all the hard work and long
    hours to isolate this leak and make finding the bug relatively
    simple.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • There is no need to dip into reserve pool. Reserve pool is used for much
    more important things. And xfs_trans_reserve will never return ENOSPC
    because punch hole is already done. If we get ENOSPC, collapse range
    will be simply failed.

    Cc: Brian Foster
    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Namjae Jeon
     
  • We reject any filesystem that is mounted with this feature bit set,
    so we don't need to check for it anywhere else. Remove the function
    for checking if the feature bit is set and any code that uses it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jie Liu
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • If the the V2 directory feature bit is not set in the superblock
    feature mask the filesystem will fail the good version check.
    Hence we don't need any other version checking on the dir2 feature
    bit in the code as the filesystem will not mount without it set.
    Remove the checking code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner