Eric Lee / smarc-fsl-linux-kernel

21 Jul, 2011

30 commits

02c24a821 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers ... Browse Code »

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara
Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:59 +0800
22735068d drivers: fix up various ->llseek() implementations ... Browse Code »

Fix up a few ->llseek() implementations that won't deal with SEEK_HOLE/SEEK_DATA
properly. Make them future proof so that if we ever add new options they will
return -EINVAL. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:58 +0800
06222e491 fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek ... Browse Code »

This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:58 +0800
c334b1138 Ext4: handle SEEK_HOLE/SEEK_DATA generically ... Browse Code »

Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:57 +0800
b26751575 Btrfs: implement our own ->llseek ... Browse Code »

In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:56 +0800
982d81658 fs: add SEEK_HOLE and SEEK_DATA flags ... Browse Code »

This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are

*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.

*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.

So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.

Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2011-07-21 08:47:56 +0800
b4d5b10fb reiserfs: make reiserfs default to barrier=flush ... Browse Code »

Change the default reiserfs mount option to barrier=flush. Based on a patch
from Jeff Mahoney in the SuSE tree.

Signed-off-by: Jeff Mahoney
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:55 +0800
00eacd66c ext3: make ext3 mount default to barrier=1 ... Browse Code »
86

This patch turns on barriers by default for ext3. mount -o barrier=0
will turn them off. Based on a patch from Chris Mason in the SuSE tree.

Signed-off-by: Chris Mason
Signed-off-by: Christoph Hellwig
Acked-by: Eric Sandeen
Acked-by: Jan Kara
Acked-by: Jeff Mahoney
Acked-by: Ted Ts'o
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:54 +0800
b85fd6bdc don't open-code parent_ino() in assorted ->readdir() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-07-21 08:47:54 +0800
2def9e4ec minix_getattr(): don't bother with ->d_parent ... Browse Code »

we can find superblock easier, TYVM...

Signed-off-by: Al Viro

Al Viro
2011-07-21 08:47:53 +0800
ee60498f3 coda_venus_readdir(): use offsetof() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-07-21 08:47:52 +0800
c066b65ab arm: don't create useless copies to pass into debugfs_create_dir() ... Browse Code »

its first argument is const char * and it's really not modified...

Signed-off-by: Al Viro

Al Viro
2011-07-21 08:47:52 +0800
12520c438 switch assorted clock drivers to debugfs_remove_recursive() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-07-21 08:47:51 +0800
f15146380 fs: seq_file - add event counter to simplify poll() support ... Browse Code »

Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers
Signed-off-by: Al Viro

Kay Sievers
2011-07-21 08:47:50 +0800
72c5052dd fs: move inode_dio_done to the end_io handler ... Browse Code »

For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done. Enable this by moving
the inode_dio_done call to the end_io handler if one exist. Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers. At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:50 +0800
aacfc19c6 fs: simplify the blockdev_direct_IO prototype ... Browse Code »

Simple filesystems always pass inode->i_sb_bdev as the block device
argument, and never need a end_io handler. Let's simply things for
them and for my grepping activity by dropping these arguments. The
only thing not falling into that scheme is ext4, which passes and
end_io handler without needing special flags (yet), but given how
messy the direct I/O code there is use of __blockdev_direct_IO
in one instead of two out of three cases isn't going to make a large
difference anyway.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:49 +0800
df2d6f265 fs: always maintain i_dio_count ... Browse Code »

Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code. Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:48 +0800
562c72aa5 fs: move inode_dio_wait calls into ->setattr ... Browse Code »

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:47 +0800
11b80f459 rw_semaphore: remove up/down_read_non_owner ... Browse Code »

Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:47 +0800
bd5fe6c5e fs: kill i_alloc_sem ... Browse Code »

i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:46 +0800
f9b5570d7 fs: simplify handling of zero sized reads in __blockdev_direct_IO ... Browse Code »

Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:45 +0800
9ea7df534 ext4: Rewrite ext4_page_mkwrite() to use generic helpers ... Browse Code »

Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.

Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.

Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2011-07-21 08:47:45 +0800
582686915 fat: remove i_alloc_sem abuse ... Browse Code »

Add a new rw_semaphore to protect bmap against truncate. Previous
i_alloc_sem was abused for this, but it's going away in this series.

Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:44 +0800
8c5dc70aa VFS: Fixup kerneldoc for generic_permission() ... Browse Code »

The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e

Signed-off-by: Tobias Klauser
Signed-off-by: Al Viro

Tobias Klauser
2011-07-21 08:47:43 +0800
e46ebd278 anonfd: fix missing declaration ... Browse Code »

The forward declaration of struct file_operations is
added to avoid compilation warnings.

Signed-off-by: Tomasz Stanislawski
Signed-off-by: Al Viro

Tomasz Stanislawski
2011-07-21 08:47:43 +0800
8daaa8314 xfs: make use of new shrinker callout for the inode cache ... Browse Code »

Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:42 +0800
8ab47664d vfs: increase shrinker batch size ... Browse Code »

Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.

To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.

To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:41 +0800
0e1fdafd9 superblock: add filesystem shrinker operations ... Browse Code »

Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.

Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:41 +0800
4f8c19fdf inode: remove iprune_sem ... Browse Code »

Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:40 +0800
b0d40c92a superblock: introduce per-sb cache shrinker infrastructure ... Browse Code »

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:10 +0800

20 Jul, 2011

10 commits

12ad3ab66 superblock: move pin_sb_for_writeback() to fs/super.c ... Browse Code »

The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:38 +0800
09cc9fc7a inode: move to per-sb LRU locks ... Browse Code »

With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:36 +0800
98b745c64 inode: Make unused inode LRU per superblock ... Browse Code »

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:35 +0800
fcb94f72d inode: convert inode_stat.nr_unused to per-cpu counters ... Browse Code »

Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:34 +0800
e9299f505 vmscan: add customisable shrinker batch size ... Browse Code »

For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.

A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:32 +0800
3567b59aa vmscan: reduce wind up shrinker->nr when shrinker can't do work ... Browse Code »

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:31 +0800
acf92b485 vmscan: shrinker->nr updates race and go wrong ... Browse Code »

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:29 +0800
095760730 vmscan: add shrink_slab tracepoints ... Browse Code »

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:27 +0800
a9049376e make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) ... Browse Code »

... and simplify the living hell out of callers

Signed-off-by: Al Viro

Al Viro
2011-07-20 13:44:26 +0800
0c1aa9a95 deuglify squashfs_lookup() ... Browse Code »

d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)

Signed-off-by: Al Viro

Al Viro
2011-07-20 13:44:24 +0800