Doug / smarc-fsl-linux-kernel | Embedian Git Server

10 Jan, 2011

1 commit

39191628e fs: fix namei.c kernel-doc notation ... Browse Code »

Fix new kernel-doc notation warnings in fs/namei.c and spell
ECHILD correctly.

Warning(fs/namei.c:218): No description found for parameter 'flags'
Warning(fs/namei.c:425): Excess function parameter 'Returns' description in 'nameidata_drop_rcu'
Warning(fs/namei.c:478): Excess function parameter 'Returns' description in 'nameidata_dentry_drop_rcu'
Warning(fs/namei.c:540): Excess function parameter 'Returns' description in 'nameidata_drop_rcu_last'

Signed-off-by: Randy Dunlap
Cc: Nick Piggin
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-01-10 23:38:53 +0800

07 Jan, 2011

9 commits

b3e19d924 fs: scale mntget/mntput ... Browse Code »

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:33 +0800
b74c79e99 fs: provide rcu-walk aware permission i_ops ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:29 +0800
34286d666 fs: rcu-walk aware d_revalidate method ... Browse Code »

Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:29 +0800
fb045adb9 fs: dcache reduce branches in lookup path ... Browse Code »

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:28 +0800
c28cc3646 fs: fs_struct use seqlock ... Browse Code »

Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.

Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:27 +0800
31e6b01f4 fs: rcu-walk for path lookup ... Browse Code »

Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:27 +0800
b5c84bf6f fs: dcache remove dcache_lock ... Browse Code »

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:23 +0800
b7ab39f63 fs: dcache scale dentry refcount ... Browse Code »

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:21 +0800
b1e6a015a fs: change d_hash for rcu-walk ... Browse Code »

Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:20 +0800

08 Dec, 2010

1 commit

b1085ba80 fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called ... Browse Code »

Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.

Signed-off-by: Lino Sanfilippo
Signed-off-by: Eric Paris

Lino Sanfilippo
2010-12-08 05:14:21 +0800

29 Oct, 2010

1 commit

d893f1bc2 fix open/umount race ... Browse Code »

nameidata_to_filp() drops nd->path or transfers it to opened
file. In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.

Fix: don't drop it, then... IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers. After
they are through with accessing nd->path...

Reported-by: Miklos Szeredi
Signed-off-by: Al Viro

Al Viro
2010-10-29 16:14:56 +0800

26 Oct, 2010

2 commits

7de9c6ee3 new helper: ihold() ... Browse Code »

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:26:11 +0800
81fca4440 fs: move permission check back into __lookup_hash ... Browse Code »

The caller that didn't need it is gone.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:18:19 +0800

18 Aug, 2010

4 commits

99b7db7b8 fs: brlock vfsmount_lock ... Browse Code »

fs: brlock vfsmount_lock

Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.

A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.

The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).

The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.

Cc: Al Viro
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:48 +0800
b04f784e5 fs: remove extra lookup in __lookup_hash ... Browse Code »

fs: remove extra lookup in __lookup_hash

Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.

Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:

- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.

Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:47 +0800
baa038907 fs: dentry allocation consolidation ... Browse Code »

fs: dentry allocation consolidation

There are 2 duplicate copies of code in dentry allocation in path lookup.
Consolidate them into a single function.

Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:45 +0800
2e2e88ea8 fs: fix do_lookup false negative ... Browse Code »

fs: fix do_lookup false negative

In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.

This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).

Fix this by removing code and have it use the normal reval path.

Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2010-08-18 20:35:45 +0800

11 Aug, 2010

2 commits

f7ad3c6be vfs: add helpers to get root and pwd ... Browse Code »

Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.

get_fs_root()
get_fs_pwd()
get_fs_root_and_pwd()

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2010-08-11 12:28:20 +0800
8c8946f50 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify ... Browse Code »

* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
fanotify: use both marks when possible
fsnotify: pass both the vfsmount mark and inode mark
fsnotify: walk the inode and vfsmount lists simultaneously
fsnotify: rework ignored mark flushing
fsnotify: remove global fsnotify groups lists
fsnotify: remove group->mask
fsnotify: remove the global masks
fsnotify: cleanup should_send_event
fanotify: use the mark in handler functions
audit: use the mark in handler functions
dnotify: use the mark in handler functions
inotify: use the mark in handler functions
fsnotify: send fsnotify_mark to groups in event handling functions
fsnotify: Exchange list heads instead of moving elements
fsnotify: srcu to protect read side of inode and vfsmount locks
fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
fsnotify: use _rcu functions for mark list traversal
fsnotify: place marks on object in order of group memory address
vfs/fsnotify: fsnotify_close can delay the final work in fput
fsnotify: store struct file not struct path
...

Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

Linus Torvalds
2010-08-11 02:39:13 +0800

02 Aug, 2010

2 commits

d09ca7397 security: make LSMs explicitly mask off permissions ... Browse Code »

SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
correctly. Presently the masking of MAY_* flags is done in the VFS. In
order to allow LSMs to decide what flags they care about and what flags
they don't just pass them all and the each LSM mask off what they don't
need. This patch should contain no functional changes to either the VFS or
any LSM.

Signed-off-by: Eric Paris
Acked-by: Stephen D. Smalley
Signed-off-by: James Morris

Eric Paris
2010-08-02 13:35:07 +0800
ea0d3ab23 LSM: Remove unused arguments from security_path_truncate(). ... Browse Code »

When commit be6d3e56a6b9b3a4ee44a0685e39e595073c6f0d "introduce new LSM hooks
where vfsmount is available." was proposed, regarding security_path_truncate(),
only "struct file *" argument (which AppArmor wanted to use) was removed.
But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
Thus, let's remove these arguments.

Signed-off-by: Tetsuo Handa
Acked-by: Nick Piggin
Signed-off-by: James Morris

Tetsuo Handa
2010-08-02 13:33:40 +0800

28 Jul, 2010

1 commit

59b0df211 fsnotify: use unsigned char * for dentry->d_name.name ... Browse Code »

fsnotify was using char * when it passed around the d_name.name string
internally but it is actually an unsigned char *. This patch switches
fsnotify to use unsigned and should silence some pointer signess warnings
which have popped out of xfs. I do not add -Wpointer-sign to the fsnotify
code as there are still issues with kstrdup and strlen which would pop
out needless warnings.

Signed-off-by: Eric Paris

Eric Paris
2010-07-28 21:59:01 +0800

28 May, 2010

1 commit

176306f59 VFS: fix recent breakage of FS_REVAL_DOT ... Browse Code »

Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.

In particular, before this patch, the command
ls -l
in an NFS mounted directory would always check if the directory on the server
had changed and if so would flush and refill the pagecache for the dir.
After this patch, the same "ls -l" will repeatedly return stale date until
the cached attributes for the directory time out.

The following patch fixes this by ensuring the d_revalidate is called by
do_last when "." is being looked-up.
link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
is not set so nfs_lookup_verify_inode chooses not to do any validation.

The following patch restores the original behaviour.

Cc: stable@kernel.org
Signed-off-by: NeilBrown
Signed-off-by: Al Viro

Neil Brown
2010-05-28 10:03:06 +0800

22 May, 2010

1 commit

9a2296832 namei.c : update mnt when it needed ... Browse Code »

update the mnt of the path when it is not equal to the new one.

Signed-off-by: Huang Shijie
Signed-off-by: Al Viro

Huang Shijie
2010-05-22 06:31:22 +0800

15 May, 2010

1 commit

d83c49f3e Fix the regression created by "set S_DEAD on unlink()..." commit ... Browse Code »

1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.

2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.

3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.

Signed-off-by: Al Viro

Al Viro
2010-05-15 19:16:33 +0800

13 May, 2010

1 commit

002baeecf vfs: Fix O_NOFOLLOW behavior for paths with trailing slashes ... Browse Code »

According to specification

mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY)

should return success but currently it returns ELOOP. This is a
regression caused by path lookup cleanup patch series.

Fix the code to ignore O_NOFOLLOW in case the provided path has trailing
slashes.

Cc: Andrew Morton
Cc: Al Viro
Reported-by: Marius Tolzmann
Acked-by: Miklos Szeredi
Signed-off-by: Jan Kara
Signed-off-by: Linus Torvalds

Jan Kara
2010-05-13 23:46:04 +0800

27 Mar, 2010

1 commit

3e297b613 Restore LOOKUP_DIRECTORY hint handling in final lookup on open() ... Browse Code »

Lose want_dir argument, while we are at it - since now
nd->flags & LOOKUP_DIRECTORY is equivalent to it.

Signed-off-by: Al Viro

Al Viro
2010-03-27 00:41:05 +0800

08 Mar, 2010

1 commit

318ae2edc Merge branch 'for-next' into for-linus ... Browse Code »

Conflicts:
Documentation/filesystems/proc.txt
arch/arm/mach-u300/include/mach/debug-macro.S
drivers/net/qlge/qlge_ethtool.c
drivers/net/qlge/qlge_main.c
drivers/net/typhoon.c

Jiri Kosina
2010-03-08 23:55:37 +0800

07 Mar, 2010

1 commit

781b16775 Fix a dumb typo - use of & instead of && ... Browse Code »

We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")

Reported-by: Walter Sheets
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2010-03-07 02:54:48 +0800

06 Mar, 2010

1 commit

e213e26ab Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...

Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

Linus Torvalds
2010-03-06 05:20:53 +0800

05 Mar, 2010

9 commits

1f36f774b Switch !O_CREAT case to use of do_last() ... Browse Code »

... and now we have all intents crap well localized

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:22:25 +0800
def4af30c Get rid of symlink body copying ... Browse Code »

Now that nd->last stays around until ->put_link() is called, we can
just postpone that ->put_link() in do_filp_open() a bit and don't
bother with copying.

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:40 +0800
3866248e5 Finish pulling of -ESTALE handling to upper level in do_filp_open() ... Browse Code »

Don't bother with path_walk() (and its retry loop); link_path_walk()
will do it.

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:38 +0800
806b681cb Turn do_link spaghetty into a normal loop ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:36 +0800
10fa8e62f Unify exits in O_CREAT handling ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:35 +0800
9e67f3616 Kill is_link argument of do_last() ... Browse Code »

We set it to 1 iff we return NULL

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:33 +0800
67ee3ad21 Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open() ... Browse Code »

Note that in case of !O_CREAT we know that nd.root has already been given up

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:31 +0800
4296e2cbf Leave mangled flag only for setting nd.intent.open.flag ... Browse Code »

Nothing else uses it anymore

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:29 +0800
5b369df82 Get rid of passing mangled flag to do_last() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:27 +0800