28 Jan, 2011

8 commits

  • When filling in the middle of a previous delayed allocation in
    xfs_bmap_add_extent_delay_real, set br_startblock of the new delay
    extent to the right to nullstartblock instead of 0 before inserting
    the extent into the ifork (xfs_iext_insert), rather than setting
    br_startblock afterward.

    Adding the extent into the ifork with br_startblock=0 can lead to
    the extent being copied into the btree by xfs_bmap_extent_to_btree
    if we happen to convert from extents format to btree format before
    updating br_startblock with the correct value. The unexpected
    addition of this delay extent to the btree can cause subsequent
    XFS_WANT_CORRUPTED_GOTO filesystem shutdown in several
    xfs_bmap_add_extent_delay_real cases where we are converting a delay
    extent to real and unexpectedly find an extent already inserted.
    For example:

    911 case BMAP_LEFT_FILLING:
    912 /*
    913 * Filling in the first part of a previous delayed allocation.
    914 * The left neighbor is not contiguous.
    915 */
    916 trace_xfs_bmap_pre_update(ip, idx, state, _THIS_IP_);
    917 xfs_bmbt_set_startoff(ep, new_endoff);
    918 temp = PREV.br_blockcount - new->br_blockcount;
    919 xfs_bmbt_set_blockcount(ep, temp);
    920 xfs_iext_insert(ip, idx, 1, new, state);
    921 ip->i_df.if_lastex = idx;
    922 ip->i_d.di_nextents++;
    923 if (cur == NULL)
    924 rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
    925 else {
    926 rval = XFS_ILOG_CORE;
    927 if ((error = xfs_bmbt_lookup_eq(cur, new->br_startoff,
    928 new->br_startblock, new->br_blockcount,
    929 &i)))
    930 goto done;
    931 XFS_WANT_CORRUPTED_GOTO(i == 0, done);

    With the bogus extent in the btree we shutdown the filesystem at
    931. The conversion from extents to btree format happens when the
    number of extents in the inode increases above ip->i_df.if_ext_max.
    xfs_bmap_extent_to_btree copies extents from the ifork into the
    btree, ignoring all delalloc extents which are denoted by
    br_startblock having some value of nullstartblock.

    SGI-PV: 1013221

    Signed-off-by: Ben Myers
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    bpm@sgi.com
     
  • Commit 368e136 ("xfs: remove duplicate code from dquot reclaim") fails
    to unlock the dquot freelist when the number of loop restarts is
    exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory
    reclaim.

    Rework the loop control logic into an unwind stack that all the
    different cases jump into. This means there is only one set of code
    that processes the loop exit criteria, and simplifies the unlocking
    of all the items from different points in the loop. It also fixes a
    double increment of the restart counter from the qi_dqlist_lock
    case.

    Reported-by: Malcolm Scott
    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Failure to commit a transaction into the CIL is not handled
    correctly. This currently can only happen when racing with a
    shutdown and requires an explicit shutdown check, so it rare and can
    be avoided. Remove the shutdown check and make the CIL commit a void
    function to indicate it will always succeed, thereby removing the
    incorrectly handled failure case.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • The extent size hint can be set to larger than an AG. This means
    that the alignment process can push the range to be allocated
    outside the bounds of the AG, resulting in assert failures or
    corrupted bmbt records. Similarly, if the extsize is larger than the
    maximum extent size supported, the alignment process will produce
    extents that are too large to fit into the bmbt records, resulting
    in a different type of assert/corruption failure.

    Fix this by limiting extsize at the time іt is set firstly to be
    less than MAXEXTLEN, then to be a maximum of half the size of the
    AGs in the filesystem for non-realtime inodes. Realtime inodes do
    not allocate out of AGs, so don't have to be restricted by the size
    of AGs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • When doing delayed allocation, if the allocation size is for a
    maximally sized extent, extent size alignment can push it over this
    limit. This results in an assert failure in xfs_bmbt_set_allf() as
    the extent length is too large to find in the extent record.

    Fix this by ensuring that we allow for space that extent size
    alignment requires (up to 2 * (extsize -1) blocks as we have to
    handle both head and tail alignment) when limiting the maximum size
    of the extent.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Delayed allocation extents can be larger than AGs, so when trying to
    convert a large range we may scan every AG inside
    xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than
    an AG. We should stop when we find the first AG with a maximum
    possible allocation size. This causes excessive CPU usage when there
    are lots of AGs.

    The same problem occurs when doing preallocation of a range larger
    than an AG.

    Fix the problem by limiting real allocation lengths to the maximum
    that an AG can support. This means if we have empty AGs, we'll stop
    the search at the first of them. If there are no empty AGs, we'll
    still scan them all, but that is a different problem....

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • rounddown_power_of_2() returns an undefined result when passed a
    value of zero. The specualtive delayed allocation code is doing this
    when the inode is zero length. Hence occasionally the preallocation
    is much, much larger than is necessary (e.g. 8GB for a 270 _byte_
    file). Ensure we don't even pass a zero value to this function so
    the result of preallocation is always the desired size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • After test 139, kmemleak shows:

    unreferenced object 0xffff880078b405d8 (size 400):
    comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
    hex dump (first 32 bytes):
    60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x2d/0x60
    [] kmem_cache_alloc+0x13f/0x2b0
    [] kmem_zone_alloc+0x77/0xf0
    [] kmem_zone_zalloc+0x1e/0x50
    [] xfs_efi_init+0x4b/0xb0
    [] xfs_trans_get_efi+0x58/0x90
    [] xfs_bmap_finish+0x8b/0x1d0
    [] xfs_itruncate_finish+0x2c4/0x5d0
    [] xfs_setattr+0x8df/0xa70
    [] xfs_vn_setattr+0x1b/0x20
    [] notify_change+0x170/0x2e0
    [] do_truncate+0x66/0xa0
    [] sys_ftruncate+0xdb/0xe0
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    The cause of the leak is that the "remove" parameter of IOP_UNPIN()
    is never set when a CIL push is aborted. This means that the EFI
    item is never freed if it was in the push being cancelled. The
    problem is specific to delayed logging, but has uncovered a couple
    of problems with the handling of IOP_UNPIN(remove).

    Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
    in the CIL commit failure path or the iclog write failure path
    because for delayed loging we have no transaction context. Hence we
    must only call xfs_trans_del_item() if the log item being unpinned
    has an active log item descriptor.

    Secondly, xfs_trans_uncommit() does not handle log item descriptor
    freeing during the traversal of log items on a transaction. It can
    reference a freed log item descriptor when unpinning an EFI item.
    Hence it needs to use a safe list traversal method to allow items to
    be removed from the transaction during IOP_UNPIN().

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     

27 Jan, 2011

1 commit

  • The kmemleak detector shows this after test 139:

    unreferenced object 0xffff880079b88bb0 (size 264):
    comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
    hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
    ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff ........H{......
    backtrace:
    [] kmemleak_alloc+0x2d/0x60
    [] kmem_cache_alloc+0x13f/0x2b0
    [] kmem_zone_alloc+0x77/0xf0
    [] kmem_zone_zalloc+0x1e/0x50
    [] xlog_ticket_alloc+0x34/0x170
    [] xlog_cil_push+0xa4/0x3f0
    [] xlog_cil_force_lsn+0x15a/0x160
    [] _xfs_log_force_lsn+0x75/0x2d0
    [] _xfs_trans_commit+0x2bd/0x2f0
    [] xfs_iomap_write_allocate+0x1ad/0x350
    [] xfs_map_blocks+0x21f/0x370
    [] xfs_vm_writepage+0x1c7/0x550
    [] __writepage+0x1a/0x50
    [] write_cache_pages+0x1c2/0x4c0
    [] generic_writepages+0x27/0x30
    [] xfs_vm_writepages+0x5d/0x80

    By inspection, the leak occurs when xlog_write() returns and error
    and we jump to the abort path without dropping the reference on the
    active ticket.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

18 Jan, 2011

1 commit

  • On platforms that call panic() inside their BUG() macro (m68k/sun3, and
    all platforms that don't set HAVE_ARCH_BUG), compilation fails with:

    | fs/xfs/support/debug.c: In function ‘xfs_cmn_err’:
    | fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function

    as the local variable "panic" conflicts with the "panic()" function.
    Rename the local variable to resolve this.

    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

17 Jan, 2011

2 commits

  • Currently all filesystems except XFS implement fallocate asynchronously,
    while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
    I/O we really want our allocation on disk, especially for the !KEEP_SIZE
    case where we actually grow the file with user-visible zeroes. On the
    other hand always commiting the transaction is a bad idea for fast-path
    uses of fallocate like for example in recent Samba versions. Given
    that block allocation is a data plane operation anyway change it from
    an inode operation to a file operation so that we have the file structure
    available that lets us check for O_SYNC.

    This also includes moving the code around for a few of the filesystems,
    and remove the already unnedded S_ISDIR checks given that we only wire
    up fallocate for regular files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of various home grown checks that might need updates for new
    flags just check for any bit outside the mask of the features supported
    by the filesystem. This makes the check future proof for any newly
    added flag.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

15 Jan, 2011

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: prevent NMI timeouts in cmn_err
    xfs: Add log level to assertion printk
    xfs: fix an assignment within an ASSERT()
    xfs: fix error handling for synchronous writes
    xfs: add FITRIM support
    xfs: ensure log covering transactions are synchronous
    xfs: serialise unaligned direct IOs
    xfs: factor common write setup code
    xfs: split buffered IO write path from xfs_file_aio_write
    xfs: split direct IO write path from xfs_file_aio_write
    xfs: introduce xfs_rw_lock() helpers for locking the inode
    xfs: factor post-write newsize updates
    xfs: factor common post-write isize handling code
    xfs: ensure sync write errors are returned

    Linus Torvalds
     

14 Jan, 2011

3 commits

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
    fs: add documentation on fallocate hole punching
    Gfs2: fail if we try to use hole punch
    Btrfs: fail if we try to use hole punch
    Ext4: fail if we try to use hole punch
    Ocfs2: handle hole punching via fallocate properly
    XFS: handle hole punching via fallocate properly
    fs: add hole punching to fallocate
    vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
    fix signedness mess in rw_verify_area() on 64bit architectures
    fs: fix kernel-doc for dcache::prepend_path
    fs: fix kernel-doc for dcache::d_validate
    sanitize ecryptfs ->mount()
    switch afs
    move internal-only parts of ncpfs headers to fs/ncpfs
    switch ncpfs
    switch 9p
    pass default dentry_operations to mount_pseudo()
    switch hostfs
    switch affs
    switch configfs
    ...

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

13 Jan, 2011

1 commit

  • This patch simply allows XFS to handle the hole punching flag in fallocate
    properly. I've tested this with a little program that does a bunch of random
    hole punching with FL_KEEP_SIZE and without it to make sure it does the right
    thing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

12 Jan, 2011

7 commits

  • We currently have a global error message buffer in cmn_err that is
    protected by a spin lock that disables interrupts. Recently there
    have been reports of NMI timeouts occurring when the console is
    being flooded by SCSI error reports due to cmn_err() getting stuck
    trying to print to the console while holding this lock (i.e. with
    interrupts disabled). The NMI watchdog is seeing this CPU as
    non-responding and so is triggering a panic. While the trigger for
    the reported case is SCSI errors, pretty much anything that spams
    the kernel log could cause this to occur.

    Realistically the only reason that we have the intemediate message
    buffer is to prepend the correct kernel log level prefix to the log
    message. The only reason we have the lock is to protect the global
    message buffer and the only reason the message buffer is global is
    to keep it off the stack. Hence if we can avoid needing a global
    message buffer we avoid needing the lock, and we can do this with a
    small amount of cleanup and some preprocessor tricks:

    1. clean up xfs_cmn_err() panic mask functionality to avoid
    needing debug code in xfs_cmn_err()
    2. remove the couple of "!" message prefixes that still exist that
    the existing cmn_err() code steps over.
    3. redefine CE_* levels directly to KERN_*
    4. redefine cmn_err() and friends to use printk() directly
    via variable argument length macros.

    By doing this, we can completely remove the cmn_err() code and the
    lock that is causing the problems, and rely solely on printk()
    serialisation to ensure that we don't get garbled messages.

    A series of followup patches is really needed to clean up all the
    cmn_err() calls and related messages properly, but that results in a
    series that is not easily back portable to enterprise kernels. Hence
    this initial fix is only to address the direct problem in the lowest
    impact way possible.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • I received a ppc64 bug report involving xfs but the assertion was
    filtered out by the console log level. Use KERN_CRIT to ensure it
    makes it out.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Anton Blanchard
     
  • In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
    label we have this:
    ASSERT(error = 0);
    I believe a comparison was intended, not an assignment. If I'm
    right, the patch below fixes that up.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Alex Elder

    Jesper Juhl
     
  • If we get an IO error on a synchronous superblock write, we attach an
    error release function to it so that when the last reference goes away
    the release function is called and the buffer is invalidated and
    unlocked. The buffer is left locked until the release function is
    called so that other concurrent users of the buffer will be locked out
    until the buffer error is fully processed.

    Unfortunately, for the superblock buffer the filesyetm itself holds a
    reference to the buffer which prevents the reference count from
    dropping to zero and the release function being called. As a result,
    once an IO error occurs on a sync write, the buffer will never be
    unlocked and all future attempts to lock the buffer will hang.

    To make matters worse, this problems is not unique to such buffers;
    if there is a concurrent _xfs_buf_find() running, the lookup will grab
    a reference to the buffer and then wait on the buffer lock, preventing
    the reference count from ever falling to zero and hence unlocking the
    buffer.

    As such, the whole b_relse function implementation is broken because it
    cannot rely on the buffer reference count falling to zero to unlock the
    errored buffer. The synchronous write error path is the only path that
    uses this callback - it is used to ensure that the synchronous waiter
    gets the buffer error before the error state is cleared from the buffer
    by the release function.

    Given that the only sychronous buffer writes now go through xfs_bwrite
    and the error path in question can only occur for a write of a dirty,
    logged buffer, we can move most of the b_relse processing to happen
    inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
    In addition to that we make sure the error is not cleared in
    xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
    Given that xfs_bwrite keeps the buffer locked until it has waited for
    it and checked the error this allows to reliably propagate the error
    to the caller, and make sure that the buffer is reliably unlocked.

    Given that xfs_buf_iodone_callbacks was the only instance of the
    b_relse callback we can remove it entirely.

    Based on earlier patches by Dave Chinner and Ajeet Yadav.

    Signed-off-by: Christoph Hellwig
    Reported-by: Ajeet Yadav
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Allow manual discards from userspace using the FITRIM ioctl. This is not
    intended to be run during normal workloads, as the freepsace btree walks
    can cause large performance degradation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • To ensure the log is covered and the filesystem idles correctly, we
    need to ensure that dummy transactions hit the disk and do not stay
    pinned in memory. If the superblock is pinned in memory, it can't
    be flushed so the log covering cannot make progress. The result is
    dependent on timing - more oftent han not we continue to issues a
    log covering transaction every 36s rather than idling after ~90s.

    Fix this by making the log covering transaction synchronous. To
    avoid additional log force from xfssyncd, make the log covering
    transaction take the place of the existing log force in the xfssyncd
    background sync process.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • We need to obtain the i_mutex, i_iolock and i_ilock during the read
    and write paths. Add a set of wrapper functions to neatly
    encapsulate the lock ordering and shared/exclusive semantics to make
    the locking easier to follow and get right.

    Note that this changes some of the exclusive locking serialisation in
    that serialisation will occur against the i_mutex instead of the
    XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
    arguably more efficient to use the mutex for such serialisation than
    the rw_sem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

11 Jan, 2011

8 commits

  • This merge pulls the XFS master branch into the latest Linus master.
    This results in a merge conflict whose best fix is not obvious.
    I manually fixed the conflict, in "fs/xfs/xfs_iget.c".

    Dave Chinner had done work that resulted in RCU freeing of inodes
    separate from what Nick Piggin had done, and their results differed
    slightly in xfs_inode_free(). The fix updates Nick's call_rcu()
    with the use of VFS_I(), while incorporating needed updates to some
    XFS inode fields implemented in Dave's series. Dave's RCU callback
    function has also been removed.

    Signed-off-by: Alex Elder

    Alex Elder
     
  • When two concurrent unaligned, non-overlapping direct IOs are issued
    to the same block, the direct Io layer will race to zero the block.
    The result is that one of the concurrent IOs will overwrite data
    written by the other IO with zeros. This is demonstrated by the
    xfsqa test 240.

    To avoid this problem, serialise all unaligned direct IOs to an
    inode with a big hammer. We need a big hammer approach as we need to
    serialise AIO as well, so we can't just block writes on locks.
    Hence, the big hammer is calling xfs_ioend_wait() while holding out
    other unaligned direct IOs from starting.

    We don't bother trying to serialised aligned vs unaligned IOs as
    they are overlapping IO and the result of concurrent overlapping IOs
    is undefined - the result of either IO is a valid result so we let
    them race. Hence we only penalise unaligned IO, which already has a
    major overhead compared to aligned IO so this isn't a major problem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The buffered IO and direct IO write paths share a common set of
    checks and limiting code prior to issuing the write. Factor that
    into a common helper function.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Complete the split of the different write IO paths by splitting the
    buffered IO write path out of xfs_file_aio_write(). This makes the
    different mechanisms of the write patchs easier to follow.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The current xfs_file_aio_write code is a mess of locking shenanigans
    to handle the different locking requirements of buffered and direct
    IO. Start to clean this up by disentangling the direct IO path from
    the mess.

    This also removes the failed direct IO fallback path to buffered IO.
    XFS handles all direct IO cases without needing to fall back to
    buffered IO, so we can safely remove this unused path. This greatly
    simplifies the logic and locking needed in the write path.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • xfs_file_aio_write() only returns the error from synchronous
    flushing of the data and inode if error == 0. At the point where
    error is being checked, it is guaranteed to be > 0. Therefore any
    errors returned by the data or fsync flush will never be returned.
    Fix the checks so we overwrite the current error once and only if an
    error really occurred.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

08 Jan, 2011

1 commit

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
    usb: don't use flush_scheduled_work()
    speedtch: don't abuse struct delayed_work
    media/video: don't use flush_scheduled_work()
    media/video: explicitly flush request_module work
    ioc4: use static work_struct for ioc4_load_modules()
    init: don't call flush_scheduled_work() from do_initcalls()
    s390: don't use flush_scheduled_work()
    rtc: don't use flush_scheduled_work()
    mmc: update workqueue usages
    mfd: update workqueue usages
    dvb: don't use flush_scheduled_work()
    leds-wm8350: don't use flush_scheduled_work()
    mISDN: don't use flush_scheduled_work()
    macintosh/ams: don't use flush_scheduled_work()
    vmwgfx: don't use flush_scheduled_work()
    tpm: don't use flush_scheduled_work()
    sonypi: don't use flush_scheduled_work()
    hvsi: don't use flush_scheduled_work()
    xen: don't use flush_scheduled_work()
    gdrom: don't use flush_scheduled_work()
    ...

    Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
    as per Tejun.

    Linus Torvalds
     

07 Jan, 2011

3 commits

  • This simple implementation just checks for no ACLs on the inode, and
    if so, then the rcu-walk may proceed, otherwise fail it.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

23 Dec, 2010

1 commit


21 Dec, 2010

2 commits

  • The only thing that the grant lock remains to protect is the grant head
    manipulations when adding or removing space from the log. These calculations
    are already based on atomic variables, so we can already update them safely
    without locks. However, the grant head manpulations require atomic multi-step
    calculations to be executed, which the algorithms currently don't allow.

    To make these multi-step calculations atomic, convert the algorithms to
    compare-and-exchange loops on the atomic variables. That is, we sample the old
    value, perform the calculation and use atomic64_cmpxchg() to attempt to update
    the head with the new value. If the head has not changed since we sampled it,
    it will succeed and we are done. Otherwise, we rerun the calculation again from
    a new sample of the head.

    This allows us to remove the grant lock from around all the grant head space
    manipulations, and that effectively removes the grant lock from the log
    completely. Hence we can remove the grant lock completely from the log at this
    point.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The log grant ticket wait queues are currently protected by the log
    grant lock. However, the queues are functionally independent from
    each other, and operations on them only require serialisation
    against other queue operations now that all of the other log
    variables they use are atomic values.

    Hence, we can make them independent of the grant lock by introducing
    new locks just to protect the lists operations. because the lists
    are independent, we can use a lock per list and ensure that reserve
    and write head queuing do not contend.

    To ensure forced shutdowns work correctly in conjunction with the
    new fast paths, ensure that we check whether the log has been shut
    down in the grant functions once we hold the relevant spin locks but
    before we go to sleep. This is needed to co-ordinate correctly with
    the wakeups that are issued on the ticket queues so we don't leave
    any processes sleeping on the queues during a shutdown.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

15 Dec, 2010

1 commit

  • cancel_rearming_delayed_work[queue]() has been superceded by
    cancel_delayed_work_sync() quite some time ago. Convert all the
    in-kernel users. The conversions are completely equivalent and
    trivial.

    Signed-off-by: Tejun Heo
    Acked-by: "David S. Miller"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Evgeniy Polyakov
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Mauro Carvalho Chehab
    Cc: netdev@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: David Woodhouse
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Andrew Morton
    Cc: netfilter-devel@vger.kernel.org
    Cc: Trond Myklebust
    Cc: linux-nfs@vger.kernel.org

    Tejun Heo