Eric Lee / linux-smarc-t335x-v3.2

25 Mar, 2011

1 commit

6c5103890 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...

Fix up conflicts in fs/{aio.c,super.c}

Linus Torvalds
2011-03-25 01:16:26 +0800

18 Mar, 2011

1 commit

9d412a43c vfs: split off vfsmount-related parts of vfs_kern_mount() ... Browse Code »

new function: mount_fs(). Does all work done by vfs_kern_mount()
except the allocation and filling of vfsmount; returns root dentry
or ERR_PTR().

vfs_kern_mount() switched to using it and taken to fs/namespace.c,
along with its wrappers.

alloc_vfsmnt()/free_vfsmnt() made static.

functions in namespace.c slightly reordered.

Signed-off-by: Al Viro

Al Viro
2011-03-18 10:10:41 +0800

17 Mar, 2011

2 commits

95f28604a fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away ... Browse Code »

We don't have proper reference counting for this yet, so we run into
cases where the device is pulled and we OOPS on flushing the fs data.
This happens even though the dirty inodes have already been
migrated to the default_backing_dev_info.

Reported-by: Torsten Hilbrich
Tested-by: Torsten Hilbrich
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2011-03-17 18:13:12 +0800
1a102ff92 vfs: bury ->get_sb() ... Browse Code »

This is an ex-parrot.

Signed-off-by: Al Viro

Al Viro
2011-03-17 04:48:06 +0800

12 Feb, 2011

1 commit

d863b50ab vfs: call rcu_barrier after ->kill_sb() ... Browse Code »

In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
inode instead of freeing the inode directly. It causes a crash when we
rmmod immediately after we umount the volume[1].

So we need to call rcu_barrier after we kill_sb so that the inode is
freed before we do rmmod. The idea is inspired by Aneesh Kumar.
rcu_barrier will wait for all callbacks to end before preceding. The
original patch was done by Tao Ma, but synchronize_rcu() is not enough
here.

1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

Tested-by: Tao Ma
Signed-off-by: Boaz Harrosh
Cc: Nick Piggin
Cc: Al Viro
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2011-02-12 08:12:19 +0800

17 Jan, 2011

1 commit

f03c65993 sanitize vfsmount refcounting changes ... Browse Code »

Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro

Al Viro
2011-01-17 02:47:07 +0800

14 Jan, 2011

1 commit

275220f0f Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...

Linus Torvalds
2011-01-14 02:45:01 +0800

07 Jan, 2011

2 commits

b3e19d924 fs: scale mntget/mntput ... Browse Code »

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:33 +0800
ceb5bdc2d fs: dcache per-bucket dcache hash locking ... Browse Code »

We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:31 +0800

13 Nov, 2010

2 commits

d4d776299 block: clean up blkdev_get() wrappers and their users ... Browse Code »

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.

* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo
Cc: Philipp Reisner
Cc: Neil Brown
Cc: Mike Snitzer
Cc: Joern Engel
Cc: Chris Mason
Cc: Jan Kara
Cc: "Theodore Ts'o"
Cc: KONISHI Ryusuke
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro

Tejun Heo
2010-11-13 18:55:18 +0800
e525fd89d block: make blkdev_get/put() handle exclusive access ... Browse Code »

Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.

Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo
Acked-by: Neil Brown
Acked-by: Ryusuke Konishi
Acked-by: Mike Snitzer
Acked-by: Philipp Reisner
Cc: Peter Osterlund
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Jan Kara
Cc: Andrew Morton
Cc: Andreas Dilger
Cc: "Theodore Ts'o"
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen
Cc: Scott Branden
Cc: Chris Mason
Cc: Steven Whitehouse
Cc: Dave Kleikamp
Cc: Joern Engel
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro

Tejun Heo
2010-11-13 18:55:17 +0800

29 Oct, 2010

5 commits

ceefda693 switch get_sb_ns() users ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:17:03 +0800
3c26ff6e4 convert get_sb_nodev() users ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:16:31 +0800
fc14f2fef convert get_sb_single() users ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:16:28 +0800
152a08366 new helper: mount_bdev() ... Browse Code »

... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:16:13 +0800
c96e41e92 beginning of transtion: ->mount() ... Browse Code »

eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:15:06 +0800

26 Oct, 2010

1 commit

63997e98a split invalidate_inodes() ... Browse Code »

Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:27:18 +0800

18 Aug, 2010

1 commit

6416ccb78 fs: scale files_lock ... Browse Code »

fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

throughput
2.6.34-rc2 24.5
+patch 24.9

us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen
Cc: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:48 +0800

10 Aug, 2010

3 commits

dca332528 no need for list_for_each_entry_safe()/resetting with superblock list ... Browse Code »

just delay __put_super() a bit

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:49:02 +0800
7a4dec538 Fix sget() race with failing mount ... Browse Code »

If sget() finds a matching superblock being set up, it'll
grab an active reference to it and grab s_umount. That's
fine - we'll wait for completion of foofs_get_sb() that way.
However, if said foofs_get_sb() fails we'll end up holding
the halfway-created superblock. deactivate_locked_super()
called by foofs_get_sb() will just unlock the sucker since
we are holding another active reference to it.

What we need is a way to tell if superblock has been successfully
set up. Unfortunately, neither ->s_root nor the check for
MS_ACTIVE quite fit. Cheap and easy way, suitable for backport:
new flag set by the (only) caller of ->get_sb(). If that flag
isn't present by the time sget() grabbed s_umount on preexisting
superblock it has found, it's seeing a stillborn and should
just bury it with deactivate_locked_super() (and repeat the search).

Longer term we want to set that flag in ->get_sb() instances (and
check for it to distinguish between "sget() found us a live sb"
and "sget() has allocated an sb, we need to set it up" in there,
instead of checking ->s_root as we do now).

Signed-off-by: Al Viro
Cc: stable@kernel.org

Al Viro
2010-08-10 04:49:01 +0800
4f331f01b vfs: don't hold s_umount over close_bdev_exclusive() call ... Browse Code »

Fix an obscure AB-BA deadlock in get_sb_bdev().

When a superblock is mounted more than once get_sb_bdev() calls
close_bdev_exclusive() to drop the extra bdev reference while holding
s_umount. However, sb->s_umount nests inside bd_mutex during
__invalidate_device() and close_bdev_exclusive() acquires bd_mutex during
blkdev_put(); thus creating an AB-BA deadlock.

This condition doesn't trigger frequently. For this condition to be
visible to lockdep, the filesystem must occupy the whole device (as
__invalidate_device() only grabs bd_mutex for the whole device), the FS
must be mounted more than once and partition rescan should be issued while
the FS is still mounted.

Fix it by dropping s_umount over close_bdev_exclusive().

Signed-off-by: Tejun Heo
Reported-by: Ciprian Docan
Cc: Al Viro
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Tejun Heo
2010-08-10 04:48:59 +0800

30 Jun, 2010

1 commit

57439f878 fs: fix superblock iteration race ... Browse Code »

list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.

list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.

Signed-off-by: Nick Piggin
Cc: Christoph Hellwig
Cc: John Stultz
Cc: Frank Mayhar
Cc: Al Viro
Signed-off-by: Linus Torvalds

npiggin@suse.de
2010-06-30 01:38:22 +0800

31 May, 2010

1 commit

d28619f15 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
quota: Convert quota statistics to generic percpu_counter
ext3 uses rb_node = NULL; to zero rb_root.
quota: Fixup dquot_transfer
reiserfs: Fix resuming of quotas on remount read-write
pohmelfs: Remove dead quota code
ufs: Remove dead quota code
udf: Remove dead quota code
quota: rename default quotactl methods to dquot_
quota: explicitly set ->dq_op and ->s_qcop
quota: drop remount argument to ->quota_on and ->quota_off
quota: move unmount handling into the filesystem
quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
quota: move remount handling into the filesystem
ocfs2: Fix use after free on remount read-only

Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c

Linus Torvalds
2010-05-31 00:11:11 +0800

28 May, 2010

1 commit

7000d3c42 fs/super: fix kernel-doc warning ... Browse Code »

Fix fs/super.c kernel-doc warning and function notation:
Warning(fs/super.c:957): No description found for parameter 'sb'

Signed-off-by: Randy Dunlap
Cc: Alexander Viro
Signed-off-by: Al Viro

Randy Dunlap
2010-05-28 10:06:23 +0800

24 May, 2010

3 commits

123e9caf1 quota: explicitly set ->dq_op and ->s_qcop ... Browse Code »

Only set the quota operation vectors if the filesystem actually supports
quota instead of doing it for all filesystems in alloc_super().

[Jan Kara: Export dquot_operations and vfs_quotactl_ops]

Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara

Christoph Hellwig
2010-05-24 20:10:17 +0800
e0ccfd959 quota: move unmount handling into the filesystem ... Browse Code »

Currently the VFS calls into the quotactl interface for unmounting
filesystems. This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount. Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.

Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem. But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara

Christoph Hellwig
2010-05-24 20:09:12 +0800
c79d967de quota: move remount handling into the filesystem ... Browse Code »

Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw. Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself. This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara

Christoph Hellwig
2010-05-24 20:06:39 +0800

22 May, 2010

13 commits

51ee049e7 vfs: add lockdep annotation to s_vfs_rename_key for ecryptfs ... Browse Code »

> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.31-2-generic #14~rbd3
> ---------------------------------------------
> firefox-3.5/4162 is trying to acquire lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
>
> but task is already holding lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
>
> other info that might help us debug this:
> 3 locks held by firefox-3.5/4162:
> #0: (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
> #1: (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [] lock_rename+0x6a/0xf0
> #2: (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [] lock_rename+0x7f/0xf0
>
> stack backtrace:
> Pid: 4162, comm: firefox-3.5 Tainted: G C 2.6.31-2-generic #14~rbd3
> Call Trace:
> [] print_deadlock_bug+0xf4/0x100
> [] validate_chain+0x4c6/0x750
> [] __lock_acquire+0x237/0x430
> [] lock_acquire+0xa5/0x150
> [] ? lock_rename+0x41/0xf0
> [] __mutex_lock_common+0x4d/0x3d0
> [] ? lock_rename+0x41/0xf0
> [] ? lock_rename+0x41/0xf0
> [] ? ecryptfs_rename+0x99/0x170
> [] mutex_lock_nested+0x46/0x60
> [] lock_rename+0x41/0xf0
> [] ecryptfs_rename+0xca/0x170
> [] vfs_rename_dir+0x13e/0x160
> [] vfs_rename+0xee/0x290
> [] ? __lookup_hash+0x102/0x160
> [] sys_renameat+0x252/0x280
> [] ? cp_new_stat+0xe4/0x100
> [] ? sysret_check+0x2e/0x69
> [] ? trace_hardirqs_on_caller+0x14d/0x190
> [] sys_rename+0x1b/0x20
> [] system_call_fastpath+0x16/0x1b

The trace above is totally reproducible by doing a cross-directory
rename on an ecryptfs directory.

The issue seems to be that sys_renameat() does lock_rename() then calls
into the filesystem; if the filesystem is ecryptfs, then
ecryptfs_rename() again does lock_rename() on the lower filesystem, and
lockdep can't tell that the two s_vfs_rename_mutexes are different. It
seems an annotation like the following is sufficient to fix this (it
does get rid of the lockdep trace in my simple tests); however I would
like to make sure I'm not misunderstanding the locking, hence the CC
list...

Signed-off-by: Roland Dreier
Cc: Tyler Hicks
Cc: Dustin Kirkland
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Roland Dreier
2010-05-22 06:31:22 +0800
18e9e5104 Introduce freeze_super and thaw_super for the fsfreeze ioctl ... Browse Code »

Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
letting it do all the work. But freezing is more of an fs thing, and doesn't
really have much to do with the bdev at all, all the work gets done with the
super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
This means that freezing a btrfs filesystem fails, which causes us to corrupt
with things like tux-on-ice which use the fsfreeze mechanism. So instead of
populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
freezing stuff into freeze_super and thaw_super. These just take the
super_block that we're freezing and does the appropriate work. It's basically
just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
use the new super helpers. I've tested this with ext4 and btrfs and verified
everything continues to work the same as before.

The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
the fs is already frozen. I thought this was a better solution than adding a
freeze counter to the super_block, but if everybody hates this idea I'm open to
suggestions. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Al Viro

Josef Bacik
2010-05-22 06:31:18 +0800
e1e46bf18 Trim includes in fs/super.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
d3f214730 Move grabbing s_umount to callers of grab_super() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
7ed1ee611 Take statfs variants to fs/statfs.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:17 +0800
01a05b337 new helper: iterate_supers() ... Browse Code »

... and switch the simple "loop over superblocks and do something"
loops to it.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
35cf7ba0b Bury __put_super_and_need_restart() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
df40c01a9 In get_super() and user_get_super() restarts are unconditional ... Browse Code »

If superblock had been still alive, we would've returned it...

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:16 +0800
1494583de fix get_active_super()/umount() race ... Browse Code »

This one needs restarts...

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
e7fe0585c fix do_emergency_remount()/umount() races ... Browse Code »

need list_for_each_entry_safe() here. Original didn't even
have restart logics, so if you race with umount() it blew up.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
6754af646 Convert simple loops over superblocks to list_for_each_entry_safe ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
8edd64bd6 get rid of restarts in sync_filesystems() ... Browse Code »

At the same time we can kill s_need_restart and local mutex in there.
__put_super() made public for a while; will be gone later.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:15 +0800
551de6f34 Leave superblocks on s_list until the end ... Browse Code »

We used to remove from s_list and s_instances at the same
time. So let's *not* do the former and skip superblocks
that have empty s_instances in the loops over s_list.

The next step, of course, will be to get rid of rescan logics
in those loops.

Signed-off-by: Al Viro

Al Viro
2010-05-22 06:31:14 +0800