26 Jan, 2012

1 commit


18 Jan, 2012

8 commits

  • With all the size field updates out of the way xfs_file_aio_write can
    be further simplified by pushing all iolock handling into
    xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using
    the generic generic_write_sync helper for synchronous writes.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • While xfs_iunlock is fine with 0 lockflags the calling conventions are much
    cleaner if xfs_file_aio_write_checks never returns without the iolock held.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Now that we use the VFS i_size field throughout XFS there is no need for the
    i_new_size field any more given that the VFS i_size field gets updated
    in ->write_end before unlocking the page, and thus is always uptodate when
    writeback could see a page. Removing i_new_size also has the advantage that
    we will never have to trim back di_size during a failed buffered write,
    given that it never gets updated past i_size.

    Note that currently the generic direct I/O code only updates i_size after
    calling our end_io handler, which requires a small workaround to make
    sure di_size actually makes it to disk. I hope to fix this properly in
    the generic code.

    A downside is that we lose the support for parallel non-overlapping O_DIRECT
    appending writes that recently was added. I don't think keeping the complex
    and fragile i_new_size infrastructure for this is a good tradeoff - if we
    really care about parallel appending writers we should investigate turning
    the iolock into a range lock, which would also allow for parallel
    non-overlapping buffered writers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • There is no fundamental need to keep an in-memory inode size copy in the XFS
    inode. We already have the on-disk value in the dinode, and the separate
    in-memory copy that we need for regular files only in the XFS inode.

    Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
    VFS inode i_size field for regular files. Switch code that was directly
    accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
    we are limited to regular files direct access of the VFS inode i_size field.

    This also allows dropping some fairly complicated code in the write path
    which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
    that is getting updated inside ->write_end.

    Note that we do not bother resetting the VFS i_size when truncating a file
    that gets freed to zero as there is no point in doing so because the VFS inode
    is no longer in use at this point. Just relax the assert in xfs_ifree to
    only check the on-disk size instead.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Replace i_pin_wait, which is only used during synchronous inode flushing
    with a bit waitqueue. This trades off a much smaller inode against
    slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
    bytes in the XFS inode.

    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • To be used for bit wakeup i_flags needs to be an unsigned long or we'll
    run into trouble on big endian systems. Because of the 1-byte i_update
    field right after it this actually causes a fairly large size increase
    on its own (4 or 8 bytes), but that increase will be more than offset
    by the next two patches.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We spent a lot of effort to maintain this field, but it always equals to the
    fork size divided by the constant size of an extent. The prime use of it is
    to assert that the two stay in sync. Just divide the fork size by the extent
    size in the few places that we actually use it and remove the overhead
    of maintaining it. Also introduce a few helpers to consolidate the places
    where we actually care about the value.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Jan, 2012

3 commits

  • .. and the just as dead bhv_desc forward declaration while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Replace the nasty if, else if, elseif condition with more natural C flow
    that expressed the logic we want here better.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • This wrapper isn't overly useful, not to say rather confusing.

    Around the call to xfs_itruncate_extents it does:

    - add tracing
    - add a few asserts in debug builds
    - conditionally update the inode size in two places
    - log the inode

    Both the tracing and the inode logging can be moved to xfs_itruncate_extents
    as they are useful for the attribute fork as well - in fact the attr code
    already does an equivalent xfs_trans_log_inode call just after calling
    xfs_itruncate_extents. The conditional size updates are a mess, and there
    was no reason to do them in two places anyway, as the first one was
    conditional on the inode having extents - but without extents we
    xfs_itruncate_extents would be a no-op and the placement wouldn't matter
    anyway. Instead move the size assignments and the asserts that make sense
    to the callers that want it.

    As a side effect of this clean up xfs_setattr_size by introducing variables
    for the old and new inode size, and moving the size updates into a common
    place.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

10 Jan, 2012

1 commit


09 Jan, 2012

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
    Kconfig: acpi: Fix typo in comment.
    misc latin1 to utf8 conversions
    devres: Fix a typo in devm_kfree comment
    btrfs: free-space-cache.c: remove extra semicolon.
    fat: Spelling s/obsolate/obsolete/g
    SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
    tools/power turbostat: update fields in manpage
    mac80211: drop spelling fix
    types.h: fix comment spelling for 'architectures'
    typo fixes: aera -> area, exntension -> extension
    devices.txt: Fix typo of 'VMware'.
    sis900: Fix enum typo 'sis900_rx_bufer_status'
    decompress_bunzip2: remove invalid vi modeline
    treewide: Fix comment and string typo 'bufer'
    hyper-v: Update MAINTAINERS
    treewide: Fix typos in various parts of the kernel, and fix some comments.
    clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
    gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
    leds: Kconfig: Fix typo 'D2NET_V2'
    sound: Kconfig: drop unknown symbol ARCH_CLPS7500
    ...

    Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
    kconfig additions, close to removed commented-out old ones)

    Linus Torvalds
     
  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
    PM / Hibernate: Implement compat_ioctl for /dev/snapshot
    PM / Freezer: fix return value of freezable_schedule_timeout_killable()
    PM / shmobile: Allow the A4R domain to be turned off at run time
    PM / input / touchscreen: Make st1232 use device PM QoS constraints
    PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
    PM / shmobile: Remove the stay_on flag from SH7372's PM domains
    PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
    PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    ARM: S3C64XX: Implement basic power domain support
    PM / shmobile: Use common always on power domain governor
    ...

    Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
    XBT_FORCE_SLEEP bit

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (22 commits)
    xfs: mark the xfssyncd workqueue as non-reentrant
    xfs: simplify xfs_qm_detach_gdquots
    xfs: fix acl count validation in xfs_acl_from_disk()
    xfs: remove unused XBT_FORCE_SLEEP bit
    xfs: remove XFS_QMOPT_DQSUSER
    xfs: kill xfs_qm_idtodq
    xfs: merge xfs_qm_dqinit_core into the only caller
    xfs: add a xfs_dqhold helper
    xfs: simplify xfs_qm_dqattach_grouphint
    xfs: nest qm_dqfrlist_lock inside the dquot qlock
    xfs: flatten the dquot lock ordering
    xfs: implement lazy removal for the dquot freelist
    xfs: remove XFS_DQ_INACTIVE
    xfs: cleanup xfs_qm_dqlookup
    xfs: cleanup dquot locking helpers
    xfs: remove the sync_mode argument to xfs_qm_dqflush_all
    xfs: remove xfs_qm_sync
    xfs: make sure to really flush all dquots in xfs_qm_quotacheck
    xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush
    xfs: remove the lid_size field in struct log_item_desc
    ...

    Fix up trivial conflict in fs/xfs/xfs_sync.c

    Linus Torvalds
     

07 Jan, 2012

1 commit


04 Jan, 2012

8 commits


24 Dec, 2011

2 commits

  • Since Linux 2.6.36 the writeback code has introduces various measures for
    live lock prevention during sync(). Unfortunately some of these are
    actively harmful for the XFS model, where the inode gets marked dirty for
    metadata from the data I/O handler.

    The older_than_this checks that are now more strictly enforced since

    writeback: avoid livelocking WB_SYNC_ALL writeback

    by only calling into __writeback_inodes_sb and thus only sampling the
    current cut off time once. But on a slow enough devices the previous
    asynchronous sync pass might not have fully completed yet, and thus XFS
    might mark metadata dirty only after that sampling of the cut off time for
    the blocking pass already happened. I have not myself reproduced this
    myself on a real system, but by introducing artificial delay into the
    XFS I/O completion workqueues it can be reproduced easily.

    Fix this by iterating over all XFS inodes in ->sync_fs and log all that
    are dirty. This might log inode that only got redirtied after the
    previous pass, but given how cheap delayed logging of inodes is it
    isn't a major concern for performance.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Tested-by: Mark Tinguely
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • If the writeback code writes back an inode because it has expired we currently
    use the non-blockin ->write_inode path. This means any inode that is pinned
    is skipped. With delayed logging and a workload that has very little log
    traffic otherwise it is very likely that an inode that gets constantly
    written to is always pinned, and thus we keep refusing to write it. The VM
    writeback code at that point redirties it and doesn't try to write it again
    for another 30 seconds. This means under certain scenarious time based
    metadata writeback never happens.

    Fix this by calling into xfs_log_inode for kupdate in addition to data
    integrity syncs, and thus transfer the inode to the log ASAP.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Tested-by: Mark Tinguely
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

22 Dec, 2011

1 commit

  • * master: (848 commits)
    SELinux: Fix RCU deref check warning in sel_netport_insert()
    binary_sysctl(): fix memory leak
    mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
    ipmi_watchdog: restore settings when BMC reset
    oom: fix integer overflow of points in oom_badness
    memcg: keep root group unchanged if creation fails
    nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
    nilfs2: unbreak compat ioctl
    cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
    evm: prevent racing during tfm allocation
    evm: key must be set once during initialization
    mmc: vub300: fix type of firmware_rom_wait_states module parameter
    Revert "mmc: enable runtime PM by default"
    mmc: sdhci: remove "state" argument from sdhci_suspend_host
    x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
    IB/qib: Correct sense on freectxts increment and decrement
    RDMA/cma: Verify private data length
    cgroups: fix a css_set not found bug in cgroup_attach_proc
    oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
    Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
    ...

    Conflicts:
    kernel/cgroup_freezer.c

    Rafael J. Wysocki
     

20 Dec, 2011

1 commit

  • On a system with lots of memory pressure that is stuck on synchronous inode
    reclaim the workqueue code will run one instance of the inode reclaim work
    item on every CPU. which is not what we want. Make sure to mark the
    xfssyncd workqueue as non-reentrant to make sure there only is one instace
    of each running globally. Also stop using special paramater for the
    workqueue; now that we guarantee each fs has only running one of each works
    at a time there is no need to artificially lower max_active and compensate
    for that by setting the WQ_CPU_INTENSIVE flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

17 Dec, 2011

3 commits


16 Dec, 2011

5 commits


15 Dec, 2011

2 commits

  • Allow xfs_qm_dqput to work without trylock loops by nesting the freelist lock
    inside the dquot qlock. In turn that requires trylocks in the reclaim path
    instead, but given it's a classic tradeoff between fast and slow path, and
    we follow the model of the inode and dentry caches.

    Document our new lock order now that it has settled.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Introduce a new XFS_DQ_FREEING flag that tells lookup and mplist walks
    to skip a dquot that is beeing freed, and use this avoid the trylock
    on the hash and mplist locks in xfs_qm_dqreclaim_one. Also simplify
    xfs_dqpurge by moving the inodes to a dispose list after marking them
    XFS_DQ_FREEING and avoid the locker ordering constraints.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Dec, 2011

1 commit

  • Do not remove dquots from the freelist when we grab a reference to them in
    xfs_qm_dqlookup, but leave them on the freelist util scanning notices that
    they have a reference. This speeds up the lookup fastpath, and greatly
    simplifies the lock ordering constraints. Note that the same scheme is
    used by the VFS inode and dentry caches.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig