21 Jul, 2011

30 commits

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Fix up a few ->llseek() implementations that won't deal with SEEK_HOLE/SEEK_DATA
    properly. Make them future proof so that if we ever add new options they will
    return -EINVAL. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
    we just return -EINVAL, in others we do the normal generic thing, and in others
    we're simply making sure that the properly due-dilligence is done. For example
    in NFS/CIFS we need to make sure the file size is update properly for the
    SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
    that is all we have to do. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Since Ext4 has its own lseek we need to make sure it handles
    SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
    case, somebody else can come along and make it do fancy things later. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
    Basically for the normal SEEK_*'s we will just defer to the generic helper, and
    for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
    hole or data. Currently this helper doesn't check for delalloc bytes for
    prealloc space, so for now treat prealloc as data until that is fixed. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
    using fiemap in things like cp cause more problems than it solves, so lets try
    and give userspace an interface that doesn't suck. We need to match solaris
    here, and the definitions are

    *o* If /whence/ is SEEK_HOLE, the offset of the start of the
    next hole greater than or equal to the supplied offset
    is returned. The definition of a hole is provided near
    the end of the DESCRIPTION.

    *o* If /whence/ is SEEK_DATA, the file pointer is set to the
    start of the next non-hole file region greater than or
    equal to the supplied offset.

    So in the generic case the entire file is data and there is a virtual hole at
    the end. That means we will just return i_size for SEEK_HOLE and will return
    the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
    the same way.

    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Change the default reiserfs mount option to barrier=flush. Based on a patch
    from Jeff Mahoney in the SuSE tree.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • This patch turns on barriers by default for ext3. mount -o barrier=0
    will turn them off. Based on a patch from Chris Mason in the SuSE tree.

    Signed-off-by: Chris Mason
    Signed-off-by: Christoph Hellwig
    Acked-by: Eric Sandeen
    Acked-by: Jan Kara
    Acked-by: Jeff Mahoney
    Acked-by: Ted Ts'o
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • we can find superblock easier, TYVM...

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • its first argument is const char * and it's really not modified...

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     
  • For filesystems that delay their end_io processing we should keep our
    i_dio_count until the the processing is done. Enable this by moving
    the inode_dio_done call to the end_io handler if one exist. Note that
    the actual move to the workqueue for ext4 and XFS is not done in
    this patch yet, but left to the filesystem maintainers. At least
    for XFS it's not needed yet either as XFS has an internal equivalent
    to i_dio_count.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Simple filesystems always pass inode->i_sb_bdev as the block device
    argument, and never need a end_io handler. Let's simply things for
    them and for my grepping activity by dropping these arguments. The
    only thing not falling into that scheme is ext4, which passes and
    end_io handler without needing special flags (yet), but given how
    messy the direct I/O code there is use of __blockdev_direct_IO
    in one instead of two out of three cases isn't going to make a large
    difference anyway.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
    This these filesystems to also protect truncate against direct I/O requests
    by using common code. Right now the only non-DIO_LOCKING filesystem that
    appears to do so is XFS, which uses an opencoded variant of the i_dio_count
    scheme.

    Behaviour doesn't change for filesystems never calling inode_dio_wait.
    For ext4 behaviour changes when using the dioread_nonlock option, which
    previously was missing any protection between truncate and direct I/O reads.
    For ocfs2 that handcrafted i_dio_count manipulations are replaced with
    the common code now enable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Let filesystems handle waiting for direct I/O requests themselves instead
    of doing it beforehand. This means filesystem-specific locks to prevent
    new dio referenes from appearing can be held. This is important to allow
    generalizing i_dio_count to non-DIO_LOCKING filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that the last users is gone these can be removed.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Reject zero sized reads as soon as we know our I/O length, and don't
    borther with locks or allocations that might have to be cleaned up
    otherwise.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
    removes the need of using i_alloc_sem to avoid races with truncate which
    seems to be the wrong locking order according to lock ordering documented in
    mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
    be problematic because we can decide to flush delay-allocated blocks which
    will acquire s_umount semaphore - again creating unpleasant lock dependency
    if not directly a deadlock.

    Also add a check for frozen filesystem so that we don't busyloop in page fault
    when the filesystem is frozen.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Add a new rw_semaphore to protect bmap against truncate. Previous
    i_alloc_sem was abused for this, but it's going away in this series.

    Note that we can't simply use i_mutex, given that the swapon code
    calls ->bmap under it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The flags parameter went away in
    d749519b444db985e40b897f73ce1898b11f997e

    Signed-off-by: Tobias Klauser
    Signed-off-by: Al Viro

    Tobias Klauser
     
  • The forward declaration of struct file_operations is
    added to avoid compilation warnings.

    Signed-off-by: Tomasz Stanislawski
    Signed-off-by: Al Viro

    Tomasz Stanislawski
     
  • Convert the inode reclaim shrinker to use the new per-sb shrinker
    operations. This allows much bigger reclaim batches to be used, and
    allows the XFS inode cache to be shrunk in proportion with the VFS
    dentry and inode caches. This avoids the problem of the VFS caches
    being shrunk significantly before the XFS inode cache is shrunk
    resulting in imbalances in the caches during reclaim.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now that the per-sb shrinker is responsible for shrinking 2 or more
    caches, increase the batch size to keep econmies of scale for
    shrinking each cache. Increase the shrinker batch size to 1024
    objects.

    To allow for a large increase in batch size, add a conditional
    reschedule to prune_icache_sb() so that we don't hold the LRU spin
    lock for too long. This mirrors the behaviour of the
    __shrink_dcache_sb(), and allows us to increase the batch size
    without needing to worry about problems caused by long lock hold
    times.

    To ensure that filesystems using the per-sb shrinker callouts don't
    cause problems, document that the object freeing method must
    reschedule appropriately inside loops.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now we have a per-superblock shrinker implementation, we can add a
    filesystem specific callout to it to allow filesystem internal
    caches to be shrunk by the superblock shrinker.

    Rather than perpetuate the multipurpose shrinker callback API (i.e.
    nr_to_scan == 0 meaning "tell me how many objects freeable in the
    cache), two operations will be added. The first will return the
    number of objects that are freeable, the second is the actual
    shrinker call.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now that we have per-sb shrinkers with a lifecycle that is a subset
    of the superblock lifecycle and can reliably detect a filesystem
    being unmounted, there is not longer any race condition for the
    iprune_sem to protect against. Hence we can remove it.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

20 Jul, 2011

10 commits

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With the inode LRUs moving to per-sb structures, there is no longer
    a need for a global inode_lru_lock. The locking can be made more
    fine-grained by moving to a per-sb LRU lock, isolating the LRU
    operations of different filesytsems completely from each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The inode unused list is currently a global LRU. This does not match
    the other global filesystem cache - the dentry cache - which uses
    per-superblock LRU lists. Hence we have related filesystem object
    types using different LRU reclaimation schemes.

    To enable a per-superblock filesystem cache shrinker, both of these
    caches need to have per-sb unused object LRU lists. Hence this patch
    converts the global inode LRU to per-sb LRUs.

    The patch only does rudimentary per-sb propotioning in the shrinker
    infrastructure, as this gets removed when the per-sb shrinker
    callouts are introduced later on.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Before we split up the inode_lru_lock, the unused inode counter
    needs to be made independent of the global inode_lru_lock. Convert
    it to per-cpu counters to do this.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • For shrinkers that have their own cond_resched* calls, having
    shrink_slab break the work down into small batches is not
    paticularly efficient. Add a custom batchsize field to the struct
    shrinker so that shrinkers can use a larger batch size if they
    desire.

    A value of zero (uninitialised) means "use the default", so
    behaviour is unchanged by this patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • When a shrinker returns -1 to shrink_slab() to indicate it cannot do
    any work given the current memory reclaim requirements, it adds the
    entire total_scan count to shrinker->nr. The idea ehind this is that
    whenteh shrinker is next called and can do work, it will do the work
    of the previously aborted shrinker call as well.

    However, if a filesystem is doing lots of allocation with GFP_NOFS
    set, then we get many, many more aborts from the shrinkers than we
    do successful calls. The result is that shrinker->nr winds up to
    it's maximum permissible value (twice the current cache size) and
    then when the next shrinker call that can do work is issued, it
    has enough scan count built up to free the entire cache twice over.

    This manifests itself in the cache going from full to empty in a
    matter of seconds, even when only a small part of the cache is
    needed to be emptied to free sufficient memory.

    Under metadata intensive workloads on ext4 and XFS, I'm seeing the
    VFS caches increase memory consumption up to 75% of memory (no page
    cache pressure) over a period of 30-60s, and then the shrinker
    empties them down to zero in the space of 2-3s. This cycle repeats
    over and over again, with the shrinker completely trashing the inode
    and dentry caches every minute or so the workload continues.

    This behaviour was made obvious by the shrink_slab tracepoints added
    earlier in the series, and made worse by the patch that corrected
    the concurrent accounting of shrinker->nr.

    To avoid this problem, stop repeated small increments of the total
    scan value from winding shrinker->nr up to a value that can cause
    the entire cache to be freed. We still need to allow it to wind up,
    so use the delta as the "large scan" threshold check - if the delta
    is more than a quarter of the entire cache size, then it is a large
    scan and allowed to cause lots of windup because we are clearly
    needing to free lots of memory.

    If it isn't a large scan then limit the total scan to half the size
    of the cache so that windup never increases to consume the whole
    cache. Reducing the total scan limit further does not allow enough
    wind-up to maintain the current levels of performance, whilst a
    higher threshold does not prevent the windup from freeing the entire
    cache under sustained workloads.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • shrink_slab() allows shrinkers to be called in parallel so the
    struct shrinker can be updated concurrently. It does not provide any
    exclusio for such updates, so we can get the shrinker->nr value
    increasing or decreasing incorrectly.

    As a result, when a shrinker repeatedly returns a value of -1 (e.g.
    a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
    sometimes updating with the scan count that wasn't used, sometimes
    losing it altogether. Worse is when a shrinker does work and that
    update is lost due to racy updates, which means the shrinker will do
    the work again!

    Fix this by making the total_scan calculations independent of
    shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
    other updates via cmpxchg loops.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • It is impossible to understand what the shrinkers are actually doing
    without instrumenting the code, so add a some tracepoints to allow
    insight to be gained.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • ... and simplify the living hell out of callers

    Signed-off-by: Al Viro

    Al Viro
     
  • d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
    so no need for that if (inode) ... in there (or ERR_PTR(0), for that
    matter)

    Signed-off-by: Al Viro

    Al Viro