07 Dec, 2011

2 commits

  • Apply the scheme used in log_regrant_write_log_space to wake up any other
    threads waiting for log space before the newly added one to
    log_regrant_write_log_space as well, and factor the code into readable
    helpers. For each of the queues we have add two helpers:

    - one to try to wake up all waiting threads. This helper will also be
    usable by xfs_log_move_tail once we remove the current opportunistic
    wakeups in it.
    - one to sleep on t_wait until enough log space is available, loosely
    modelled after Linux waitqueues.

    And use them to reimplement the guts of log_regrant_write_log_space and
    log_regrant_write_log_space. These two function now use one and the same
    algorithm for waiting on log space instead of subtly different ones before,
    with an option to completely unify them in the near future.

    Also move the filesystem shutdown handling to the common caller given
    that we had to touch it anyway.

    Based on hard debugging and an earlier patch from
    Chandra Seetharaman .

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chandra Seetharaman
    Tested-by: Chandra Seetharaman
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • The i_ino field in the VFS inode is of type unsigned long and thus can't
    hold the full 64-bit inode number on 32-bit kernels. We have the full
    inode number in the XFS inode, so use that one for nfs exports. Note
    that I've also switched the 32-bit file handles types to it, just to make
    the code more consistent and copy & paste errors less likely to happen.

    Reported-by: Guoquan Yang
    Reported-by: Hank Peng
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

03 Dec, 2011

1 commit

  • When testing the new xfstests --large-fs option that does very large
    file preallocations, this assert was tripped deep in
    xfs_alloc_vextent():

    XFS: Assertion failed: args->minlen maxlen, file: fs/xfs/xfs_alloc.c, line: 2239

    The allocation was trying to allocate a zero length extent because
    the lower 32 bits of the allocation length was zero. The remaining
    length of the allocation to be done was an exact multiple of 2^32 -
    the first case I saw was at 496TB remaining to be allocated.

    This turns out to be an overflow when converting the allocation
    length (a 64 bit quantity) into the extent length to allocate (a 32
    bit quantity), and it requires the length to be allocated an exact
    multiple of 2^32 blocks to trip the assert.

    Fix it by limiting the extent lenth to allocate to MAXEXTLEN.

    Signed-off-by: Dave Chinner
    Signed-off-by: Ben Myers
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

30 Nov, 2011

2 commits

  • With Dmitry fsstress updates I've seen very reproducible crashes in
    xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that
    the attributes would not fit inline into the inode after removing an
    attribute. It turns out that we were operating on an inode with lots
    of delalloc extents, and thus an if_bytes values for the data fork that
    is larger than biggest possible on-disk storage for it which utterly
    confuses the code near the end of xfs_attr_shortform_bytesfit.

    Fix this by always allowing the current attribute fork, like we already
    do for the attr1 format, given that delalloc conversion will take care
    for moving either the data or attribute area out of line if it doesn't
    fit at that point - or making the point moot by merging extents at this
    point.

    Also document the function better, and clean up some loose bits.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

29 Nov, 2011

1 commit


16 Nov, 2011

1 commit

  • The doalloc arg in xfs_qm_dqattach_one() is a flag that indicates
    whether a new area to handle quota information will be allocated
    if needed. Originally, it was passed to xfs_qm_dqget(), but has
    been removed by the following commit (probably by mistake):

    commit 8e9b6e7fa4544ea8a0e030c8987b918509c8ff47
    Author: Christoph Hellwig
    Date: Sun Feb 8 21:51:42 2009 +0100

    xfs: remove the unused XFS_QMOPT_DQLOCK flag

    As the result, xfs_qm_dqget() called from xfs_qm_dqattach_one()
    never allocates the new area even if it is needed.

    This patch gives the doalloc arg to xfs_qm_dqget() in
    xfs_qm_dqattach_one() to fix this problem.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Mitsuo Hayasaka
     

09 Nov, 2011

3 commits

  • Ensure ioend->io_error gets propagated back to e.g. AIO completions.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Christoph Hellwig
     
  • The log item ops aren't nessecarily the biggest exploit vector, but marking
    them const is easy enough. Also remove the unused xfs_item_ops_t typedef
    while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Alex Elder

    Christoph Hellwig
     
  • Fixes a possible memory corruption when the link is larger than
    MAXPATHLEN and XFS_DEBUG is not enabled. This also remove the
    S_ISLNK assert, since the inode mode is checked previously in
    xfs_readlink_by_handle() and via VFS.

    Updated to address concerns raised by Ben Hutchings about the loose
    attention paid to 32- vs 64-bit values, and the lack of handling a
    potentially negative pathlen value:
    - Changed type of "pathlen" to be xfs_fsize_t, to match that of
    ip->i_d.di_size
    - Added checking for a negative pathlen to the too-long pathlen
    test, and generalized the message that gets reported in that case
    to reflect the change
    As a result, if a negative pathlen were encountered, this function
    would return EFSCORRUPTED (and would fail an assertion for a debug
    build)--just as would a too-long pathlen.

    Signed-off-by: Alex Elder
    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig

    Carlos Maiolino
     

02 Nov, 2011

1 commit


01 Nov, 2011

2 commits

  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Direct reclaim should never writeback pages. For now, handle the
    situation and warn about it. Ultimately, this will be a BUG_ON.

    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2011

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits)
    xfs: add AIL pushing tracepoints
    xfs: put in missed fix for merge problem
    xfs: do not flush data workqueues in xfs_flush_buftarg
    xfs: remove XFS_bflush
    xfs: remove xfs_buf_target_name
    xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks
    xfs: clean up xfs_ioerror_alert
    xfs: clean up buffer allocation
    xfs: remove buffers from the delwri list in xfs_buf_stale
    xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE
    xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF
    xfs: remove XFS_BUF_FINISH_IOWAIT
    xfs: remove xfs_get_buftarg_list
    xfs: fix buffer flushing during unmount
    xfs: optimize fsync on directories
    xfs: reduce the number of log forces from tail pushing
    xfs: Don't allocate new buffers on every call to _xfs_buf_find
    xfs: simplify xfs_trans_ijoin* again
    xfs: unlock the inode before log force in xfs_change_file_space
    xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata
    ...

    Linus Torvalds
     

25 Oct, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
    MAINTAINERS: linux-m32r is moderated for non-subscribers
    linux@lists.openrisc.net is moderated for non-subscribers
    Drop default from "DM365 codec select" choice
    parisc: Kconfig: cleanup Kernel page size default
    Kconfig: remove redundant CONFIG_ prefix on two symbols
    cris: remove arch/cris/arch-v32/lib/nand_init.S
    microblaze: add missing CONFIG_ prefixes
    h8300: drop puzzling Kconfig dependencies
    MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
    tty: drop superfluous dependency in Kconfig
    ARM: mxc: fix Kconfig typo 'i.MX51'
    Fix file references in Kconfig files
    aic7xxx: fix Kconfig references to READMEs
    Fix file references in drivers/ide/
    thinkpad_acpi: Fix printk typo 'bluestooth'
    bcmring: drop commented out line in Kconfig
    btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
    doc: raw1394: Trivial typo fix
    CIFS: Don't free volume_info->UNC until we are entirely done with it.
    treewide: Correct spelling of successfully in comments
    ...

    Linus Torvalds
     
  • * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
    TOMOYO: Fix incomplete read after seek.
    Smack: allow to access /smack/access as normal user
    TOMOYO: Fix unused kernel config option.
    Smack: fix: invalid length set for the result of /smack/access
    Smack: compilation fix
    Smack: fix for /smack/access output, use string instead of byte
    Smack: domain transition protections (v3)
    Smack: Provide information for UDS getsockopt(SO_PEERCRED)
    Smack: Clean up comments
    Smack: Repair processing of fcntl
    Smack: Rule list lookup performance
    Smack: check permissions from user space (v2)
    TOMOYO: Fix quota and garbage collector.
    TOMOYO: Remove redundant tasklist_lock.
    TOMOYO: Fix domain transition failure warning.
    TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
    TOMOYO: Simplify garbage collector.
    TOMOYO: Fix make namespacecheck warnings.
    target: check hex2bin result
    encrypted-keys: check hex2bin result
    ...

    Linus Torvalds
     

19 Oct, 2011

2 commits


18 Oct, 2011

1 commit


12 Oct, 2011

21 commits

  • When we call xfs_flush_buftarg (generally from sync or umount) it already
    is too late to flush the data workqueues, as I/O completion is signalled
    for them and we are thus already done with the data we would flush here.

    There are places where flushing them might be useful, but the current
    sync interface doesn't give us that opportunity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The calling convention that returns a pointer to a static buffer is
    fairly nasty, so just opencode it in the only caller that is left.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Use xfs_ioerror_alert instead of opencoding a very similar error
    message.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Instead of passing the block number and mount structure explicitly
    get them off the bp and fix make the argument order more natural.

    Also move it to xfs_buf.c and stop printing the device name given
    that we already get the fs name as part of xfs_alert, and we know
    what device is operates on because of the caller that gets printed,
    finally rename it to xfs_buf_ioerror_alert and pass __func__ as
    argument where it makes sense.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Change _xfs_buf_initialize to allocate the buffer directly and rename it to
    xfs_buf_alloc now that is the only buffer allocation routine. Also remove
    the xfs_buf_deallocate wrapper around the kmem_zone_free calls for buffers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either
    directly before or after it, or are guaranteed by the surrounding
    conditionals that we are never called on delwri buffers. Simply
    this situation by moving the call to xfs_buf_delwri_dequeue into
    xfs_buf_stale.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The code is unused and under a config option that doesn't exist, remove it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The code to flush buffers in the umount code is a bit iffy: we first
    flush all delwri buffers out, but then might be able to queue up a
    new one when logging the sb counts. On a normal shutdown that one
    would get flushed out when doing the synchronous superblock write in
    xfs_unmountfs_writesb, but we skip that one if the filesystem has
    been shut down.

    Fix this by moving the delwri list flushing until just before unmounting
    the log, and while we're at it also remove the superflous delwri list
    and buffer lru flusing for the rt and log device that can never have
    cached or delwri buffers.

    Signed-off-by: Christoph Hellwig
    Reported-by: Amit Sahrawat
    Tested-by: Amit Sahrawat
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Directories are only updated transactionally, which means fsync only
    needs to flush the log the inode is currently dirty, but not bother
    with checking for dirty data, non-transactional updates, and most
    importanly doesn't have to flush disk caches except as part of a
    transaction commit.

    While the first two optimizations can't easily be measured, the
    latter actually makes a difference when doing lots of fsync that do
    not actually have to commit the inode, e.g. because an earlier fsync
    already pushed the log far enough.

    The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except
    for the prototype, but I'm not sure creating a common helper for the
    two is worth it given how simple the functions are.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The AIL push code will issue a log force on ever single push loop
    that it exits and has encountered pinned items. It doesn't rescan
    these pinned items until it revisits the AIL from the start. Hence
    we only need to force the log once per walk from the start of the
    AIL to the target LSN.

    This results in numbers like this:

    xs_push_ail_flush..... 1456
    xs_log_force......... 1485

    For an 8-way 50M inode create workload - almost all the log forces
    are coming from the AIL pushing code.

    Reduce the number of log forces by only forcing the log if the
    previous walk found pinned buffers. This reduces the numbers to:

    xs_push_ail_flush..... 665
    xs_log_force......... 682

    For the same test.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
    ~1 million cache hit lookups to ~3000 buffer creates. That's almost
    3 orders of magnitude more cahce hits than misses, so optimising for
    cache hits is quite important. In the cache hit case, we do not need
    to allocate a new buffer in case of a cache miss, so we are
    effectively hitting the allocator for no good reason for vast the
    majority of calls to _xfs_buf_find. 8-way create workloads are
    showing similar cache hit/miss ratios.

    The result is profiles that look like this:

    samples pcnt function DSO
    _______ _____ _______________________________ _________________

    1036.00 10.0% _xfs_buf_find [kernel.kallsyms]
    582.00 5.6% kmem_cache_alloc [kernel.kallsyms]
    519.00 5.0% __memcpy [kernel.kallsyms]
    468.00 4.5% __ticket_spin_lock [kernel.kallsyms]
    388.00 3.7% kmem_cache_free [kernel.kallsyms]
    331.00 3.2% xfs_log_commit_cil [kernel.kallsyms]

    Further, there is a fair bit of work involved in initialising a new
    buffer once a cache miss has occurred and we currently do that under
    the rbtree spinlock. That increases spinlock hold time on what are
    heavily used trees.

    To fix this, remove the initialisation of the buffer from
    _xfs_buf_find() and only allocate the new buffer once we've had a
    cache miss. Initialise the buffer immediately after allocating it in
    xfs_buf_get, too, so that is it ready for insert if we get another
    cache miss after allocation. This minimises lock hold time and
    avoids unnecessary allocator churn. The resulting profiles look
    like:

    samples pcnt function DSO
    _______ _____ ___________________________ _________________

    8111.00 9.1% _xfs_buf_find [kernel.kallsyms]
    4380.00 4.9% __memcpy [kernel.kallsyms]
    4341.00 4.8% __ticket_spin_lock [kernel.kallsyms]
    3401.00 3.8% kmem_cache_alloc [kernel.kallsyms]
    2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms]
    2625.00 2.9% __kmalloc [kernel.kallsyms]
    2380.00 2.7% kfree [kernel.kallsyms]
    2016.00 2.3% kmem_cache_free [kernel.kallsyms]

    Showing a significant reduction in time spent doing allocation and
    freeing from slabs (kmem_cache_alloc and kmem_cache_free).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • There is no reason to keep a reference to the inode even if we unlock
    it during transaction commit because we never drop a reference between
    the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
    back into xfs_trans_ijoin - the third argument decides if an unlock
    is needed now.

    I'm actually starting to wonder if allowing inodes to be unlocked
    at transaction commit really is worth the effort. The only real
    benefit is that they can be unlocked earlier when commiting a
    synchronous transactions, but that could be solved by doing the
    log force manually after the unlock, too.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Let the transaction commit unlock the inode before it potentially causes
    a synchronous log force.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Only read the LSN we need to push to with the ilock held, and then release
    it before we do the log force to improve concurrency.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Only read the LSN we need to push to with the ilock held, and then release
    it before we do the log force to improve concurrency.

    This also removes the only direct caller of _xfs_trans_commit, thus
    allowing it to be merged into the plain xfs_trans_commit again.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • XFS_TRANS_SWAPEXT is a transaction type, not a flag for xfs_trans_commit, so
    don't pass it in xfs_swap_extents.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • In xfs_ioc_trim it is possible that computing the last allocation group
    to discard might overflow for big start & len values, because the result
    might be bigger then xfs_agnumber_t which is 32 bit long. Fix this by not
    allowing the start and end block of the range to be beyond the end of the
    file system.

    Note that if the start is beyond the end of the file system we have to
    return -EINVAL, but in the "end" case we have to truncate it to the fs
    size.

    Also introduce "end" variable, rather than using start+len which which
    might be more confusing to get right as this bug shows.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Lukas Czerner