Eric Lee / smarc-fsl-linux-kernel

22 May, 2010

17 commits

51ee049e7 vfs: add lockdep annotation to s_vfs_rename_key for ecryptfs ... Browse Code »

> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.31-2-generic #14~rbd3
> ---------------------------------------------
> firefox-3.5/4162 is trying to acquire lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
>
> but task is already holding lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
>
> other info that might help us debug this:
> 3 locks held by firefox-3.5/4162:
> #0: (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
> #1: (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [] lock_rename+0x6a/0xf0
> #2: (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [] lock_rename+0x7f/0xf0
>
> stack backtrace:
> Pid: 4162, comm: firefox-3.5 Tainted: G C 2.6.31-2-generic #14~rbd3
> Call Trace:
> [] print_deadlock_bug+0xf4/0x100
> [] validate_chain+0x4c6/0x750
> [] __lock_acquire+0x237/0x430
> [] lock_acquire+0xa5/0x150
> [] ? lock_rename+0x41/0xf0
> [] __mutex_lock_common+0x4d/0x3d0
> [] ? lock_rename+0x41/0xf0
> [] ? lock_rename+0x41/0xf0
> [] ? ecryptfs_rename+0x99/0x170
> [] mutex_lock_nested+0x46/0x60
> [] lock_rename+0x41/0xf0
> [] ecryptfs_rename+0xca/0x170
> [] vfs_rename_dir+0x13e/0x160
> [] vfs_rename+0xee/0x290
> [] ? __lookup_hash+0x102/0x160
> [] sys_renameat+0x252/0x280
> [] ? cp_new_stat+0xe4/0x100
> [] ? sysret_check+0x2e/0x69
> [] ? trace_hardirqs_on_caller+0x14d/0x190
> [] sys_rename+0x1b/0x20
> [] system_call_fastpath+0x16/0x1b

The trace above is totally reproducible by doing a cross-directory
rename on an ecryptfs directory.

The issue seems to be that sys_renameat() does lock_rename() then calls
into the filesystem; if the filesystem is ecryptfs, then
ecryptfs_rename() again does lock_rename() on the lower filesystem, and
lockdep can't tell that the two s_vfs_rename_mutexes are different. It
seems an annotation like the following is sufficient to fix this (it
does get rid of the lockdep trace in my simple tests); however I would
like to make sure I'm not misunderstanding the locking, hence the CC
list...

Signed-off-by: Roland Dreier
Cc: Tyler Hicks
Cc: Dustin Kirkland
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Roland Dreier
2010-05-22 06:31:22 +0800
18e9e5104 Introduce freeze_super and thaw_super for the fsfreeze ioctl ... Browse Code »

Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
letting it do all the work. But freezing is more of an fs thing, and doesn't
really have much to do with the bdev at all, all the work gets done with the
super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
This means that freezing a btrfs filesystem fails, which causes us to corrupt
with things like tux-on-ice which use the fsfreeze mechanism. So instead of
populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
freezing stuff into freeze_super and thaw_super. These just take the
super_block that we're freezing and does the appropriate work. It's basically
just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
use the new super helpers. I've tested this with ext4 and btrfs and verified
everything continues to work the same as before.

The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
the fs is already frozen. I thought this was a better solution than adding a
freeze counter to the super_block, but if everybody hates this idea I'm open to
suggestions. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2010-05-22 06:31:18 +0800
e1e46bf18 Trim includes in fs/super.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
d3f214730 Move grabbing s_umount to callers of grab_super() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
7ed1ee611 Take statfs variants to fs/statfs.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
01a05b337 new helper: iterate_supers() ... Browse Code »

... and switch the simple "loop over superblocks and do something"
loops to it.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
35cf7ba0b Bury __put_super_and_need_restart() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
df40c01a9 In get_super() and user_get_super() restarts are unconditional ... Browse Code »

If superblock had been still alive, we would've returned it...

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
1494583de fix get_active_super()/umount() race ... Browse Code »

This one needs restarts...

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
e7fe0585c fix do_emergency_remount()/umount() races ... Browse Code »

need list_for_each_entry_safe() here. Original didn't even
have restart logics, so if you race with umount() it blew up.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
6754af646 Convert simple loops over superblocks to list_for_each_entry_safe ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
8edd64bd6 get rid of restarts in sync_filesystems() ... Browse Code »

At the same time we can kill s_need_restart and local mutex in there.
__put_super() made public for a while; will be gone later.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
551de6f34 Leave superblocks on s_list until the end ... Browse Code »

We used to remove from s_list and s_instances at the same
time. So let's *not* do the former and skip superblocks
that have empty s_instances in the loops over s_list.

The next step, of course, will be to get rid of rescan logics
in those loops.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:14 +0800
1712ac8fd Saner locking around deactivate_super() ... Browse Code »

Make sure that s_umount is acquired *before* we drop the final
active reference; we still have the fast path (atomic_dec_unless)
and we have gotten rid of the window between the moment when
s_active hits zero and s_umount is acquired. Which simplifies
the living hell out of grab_super() and inotify pin_to_kill()
stuff.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:14 +0800
b20bd1a5e get rid of S_BIAS ... Browse Code »

use atomic_inc_not_zero(&sb->s_active) instead of playing games with
checking ->s_count > S_BIAS

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:14 +0800
389b8be6e get rid of open-coded grab_super() in get_active_super() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:14 +0800
a135aa2cd remove incorrect comment in do_emergency_remount ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-05-22 06:31:12 +0800

30 Apr, 2010

1 commit

5477d0fac fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCK ... Browse Code »

When CONFIG_BLOCK is set, it ends up getting backing-dev.h included.
But for !CONFIG_BLOCK, it isn't so lucky. The proper thing to do is
include directly from the file it's used from,
so do that.

Signed-off-by: Jens Axboe

Jens Axboe
2010-04-30 02:33:35 +0800

25 Apr, 2010

1 commit

5129a469a Catch filesystems lacking s_bdi ... Browse Code »

noop_backing_dev_info is used only as a flag to mark filesystems that
don't have any backing store, like tmpfs, procfs, spufs, etc.

Signed-off-by: Joern Engel

Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
to the noop_backing_dev_info is not legal and will not result in
them being flushed, but we already catch this condition in
__mark_inode_dirty() when checking for a registered bdi.

Signed-off-by: Jens Axboe

Jörn Engel
2010-04-25 14:54:42 +0800

04 Mar, 2010

2 commits

8089352a1 Mirror MS_KERNMOUNT in ->mnt_flags ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-04 03:08:00 +0800
d208bbdda fs: improve remount,ro vs buffercache coherency ... Browse Code »

Invalidate sb->s_bdev on remount,ro.

Fixes a problem reported by Jorge Boncompte who is seeing corruption
trying to snapshot a minix filesystem image. Some filesystems modify
their metadata via a path other than the bdev buffer cache (eg. they may
use a private linear mapping for their metadata, or implement directories
in pagecache, etc). Also, file data modifications usually go to the bdev
via their own mappings.

These updates are not coherent with buffercache IO (eg. via /dev/bdev)
and never have been. However there could be a reasonable expectation that
after a mount -oremount,ro operation then the buffercache should
subsequently be coherent with previous filesystem modifications.

So invalidate the bdev mappings on a remount,ro operation to provide a
coherency point.

The problem was exposed when we switched the old rd to brd because old rd
didn't really function like a normal block device and updates to rd via
mappings other than the buffercache would still end up going into its
buffercache. But the same problem has always affected other "normal"
block devices, including loop.

[akpm@linux-foundation.org: repair comment layout]
Reported-by: "Jorge Boncompte [DTI2]"
Tested-by: "Jorge Boncompte [DTI2]"
Signed-off-by: Nick Piggin
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Nick Piggin
2010-03-04 02:00:20 +0800

24 Dec, 2009

1 commit

9329d1bea vfs: get_sb_single() - do not pass options twice ... Browse Code »

Filesystem code usually destroys the option buffer while
parsing it. This leads to errors when the same buffer is
passed twice. In case we fill a new superblock do not call
remount.

This is needed to quite a warning that the debugfs code
causes every boot.

Cc: Miklos Szeredi
Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Kay Sievers
2009-12-24 03:23:43 +0800

24 Sep, 2009

3 commits

4504230a7 freeze_bdev: grab active reference to frozen superblocks ... Browse Code »

Currently we held s_umount while a filesystem is frozen, despite that we
might return to userspace and unlock it from a different process. Instead
grab an active reference to keep the file system busy and add an explicit
check for frozen filesystems in remount and reject the remount instead
of blocking on s_umount.

Add a new get_active_super helper to super.c for use by freeze_bdev that
grabs an active reference to a superblock from a given block device.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-09-24 19:47:41 +0800
4fadd7bb2 freeze_bdev: kill bd_mount_sem ... Browse Code »

Now that we have the freeze count there is not much reason for bd_mount_sem
anymore. The actual freeze/thaw operations are serialized using the
bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
get_sb_bdev which tries to prevent mounting a filesystem while the block
device is frozen. Instead of add a check for bd_fsfreeze_count and
return -EBUSY if a filesystem is frozen. While that is a change in user
visible behaviour a failing mount is much better for this case rather
than having the mount process stuck uninterruptible for a long time.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-09-24 19:47:39 +0800
42cb56ae2 vfs: change sb->s_maxbytes to a loff_t ... Browse Code »

sb->s_maxbytes is supposed to indicate the maximum size of a file that can
exist on the filesystem. It's declared as an unsigned long long.

Even if a filesystem has no inherent limit that prevents it from using
every bit in that unsigned long long, it's still problematic to set it to
anything larger than MAX_LFS_FILESIZE. There are places in the kernel
that cast s_maxbytes to a signed value. If it's set too large then this
cast makes it a negative number and generally breaks the comparison.

Change s_maxbytes to be loff_t instead. That should help eliminate the
temptation to set it too large by making it a signed value.

Also, add a warning for couple of releases to help catch filesystems that
set s_maxbytes too large. Eventually we can either convert this to a
BUG() or just remove it and in the hope that no one will get it wrong now
that it's a signed value.

Signed-off-by: Jeff Layton
Cc: Johannes Weiner
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Robert Love
Cc: Mandeep Singh Baines
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Jeff Layton
2009-09-24 19:47:33 +0800

22 Sep, 2009

1 commit

b87221de6 const: mark remaining super_operations const ... Browse Code »

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-22 22:17:24 +0800

16 Sep, 2009

1 commit

32a88aa1b fs: Assign bdi in super_block ... Browse Code »

We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().

Note that ->s_bdi assignment is required for proper writeback!

Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:51 +0800

11 Sep, 2009

2 commits

03ba3782e writeback: switch to per-bdi threads for flushing data ... Browse Code »

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800
66f3b8e2e writeback: move dirty inodes from super_block to backing_dev_info ... Browse Code »

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800

24 Jun, 2009

2 commits

f21f62208 ... and the same for vfsmount id/mount group id ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-06-24 20:15:26 +0800
c63e09ecc Make allocation of anon devices cheaper ... Browse Code »

Standard trick - add a new variable (start) such that
for each n < start n is known to be busy. Allocation can
skip checking everything in [0..start) and if it returns
n, we can set start to n + 1. Freeing below start sets
start to what we'd just freed.

Of course, it still sucks if we do something like
free 0
allocate
allocate
in a loop - still O(n^2) time. However, on saner loads it
improves the things a lot and the entire thing is not worth
the trouble of switching to something with better worst-case
behaviour.

Signed-off-by: Al Viro

Al Viro
2009-06-24 20:15:25 +0800

17 Jun, 2009

1 commit

b0895513f remove unlock_kernel() left accidentally ... Browse Code »

commit 337eb00a2c3a421999c39c94ce7e33545ee8baa7
Push BKL down into ->remount_fs()
and
commit 4aa98cf768b6f2ea4b204620d949a665959214f6
Push BKL down into do_remount_sb()

were uncorrectly merged.
The former removes one pair of lock/unlock_kernel(), but the latter adds
several unlock_kernel(). Finally a few unlock_kernel() calls left.

Signed-off-by: J. R. Okajima
Signed-off-by: Al Viro

J. R. Okajima
2009-06-17 12:36:35 +0800

12 Jun, 2009

8 commits

337eb00a2 Push BKL down into ->remount_fs() ... Browse Code »

[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani
Signed-off-by: Al Viro

Alessio Igor Bogani
2009-06-12 09:36:11 +0800
ebc1ac164 ->write_super lock_super pushdown ... Browse Code »

Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

* bfs, nilfs2 - no other uses of s_lock and have internal locks in
->write_super
* ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
* reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
->write_super
* xfs - no other uses of s_lock and uses internal lock (buffer lock on
superblock buffer) to serialize ->write_super. Also xfs_fs_write_super
is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-06-12 09:36:09 +0800
4aa98cf76 Push BKL down into do_remount_sb() ... Browse Code »

[folded fix from Jiri Slaby]

Signed-off-by: Al Viro

Al Viro
2009-06-12 09:36:08 +0800
bbd6851a3 Push lock_super() into the ->remount_fs() of filesystems that care about it ... Browse Code »

Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own. Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro

Al Viro
2009-06-12 09:36:08 +0800
6cfd01484 push BKL down into ->put_super ... Browse Code »

Move BKL into ->put_super from the only caller. A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-06-12 09:36:07 +0800
a9e220f83 No need to do lock_super() for exclusion in generic_shutdown_super() ... Browse Code »

We can't run into contention on it. All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held. So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it. The rest of filesystems do not care at all.

Signed-off-by: Al Viro

Al Viro
2009-06-12 09:36:07 +0800
443b94baa Make sure that all callers of remount hold s_umount exclusive ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-06-12 09:36:07 +0800
e50047533 cleanup sync_supers ... Browse Code »

Merge the write_super helper into sync_super and move the check for
->write_super earlier so that we can avoid grabbing a reference to
a superblock that doesn't have it.

While we're at it also add a little comment documenting sync_supers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-06-12 09:36:06 +0800