Doug / smarc-fsl-linux-kernel | Embedian Git Server

21 Dec, 2012

2 commits

b911a6bde vfs: d_obtain_alias() needs to use "/" as default name. ... Browse Code »

NFS appears to use d_obtain_alias() to create the root dentry rather than
d_make_root. This can cause 'prepend_path()' to complain that the root
has a weird name if an NFS filesystem is lazily unmounted. e.g. if
"/mnt" is an NFS mount then

{ cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }

will cause a WARN message like
WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
...
Root dentry has weird name <>

to appear in kernel logs.

So change d_obtain_alias() to use "/" rather than "" as the anonymous
name.

Signed-off-by: NeilBrown
Cc: Trond Myklebust
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

NeilBrown
2012-12-21 07:49:10 +0800
39e3c9553 vfs: remove DCACHE_NEED_LOOKUP ... Browse Code »

The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.

Cc: Josef Bacik
Signed-off-by: Jeff Layton
Signed-off-by: Al Viro

Jeff Layton
2012-12-21 02:57:36 +0800

03 Oct, 2012

1 commit

aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800

30 Sep, 2012

1 commit

8110e16d4 vfs: dcache: fix deadlock in tree traversal ... Browse Code »

IBM reported a deadlock in select_parent(). This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.

There are two cases when the traversal needs to be restarted:

1) concurrent d_move(); this can only happen when not already locked,
since taking rename_lock protects against concurrent d_move().

2) racing with final d_put() on child just at the moment of ascending
to parent; rename_lock doesn't protect against this rare race, so it
can happen when already locked.

Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held. This patch fixes all three callers of
try_to_ascend().

IBM reported that the deadlock is gone with this patch.

[ I rewrote the patch to be smaller and just do the "goto again" if the
lock was already held, but credit goes to Miklos for the real work.
- Linus ]

Signed-off-by: Miklos Szeredi
Cc: Al Viro
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Miklos Szeredi
2012-09-30 08:41:40 +0800

28 Sep, 2012

1 commit

fd5179094 trivial select_parent documentation fix ... Browse Code »

"Search list for X" sounds like you're trying to find X on a list.

Signed-off-by: J. Bruce Fields
Signed-off-by: Linus Torvalds

J. Bruce Fields
2012-09-28 06:43:08 +0800

27 Sep, 2012

1 commit

1fe0c0230 vfs: delete surplus inode NULL check ... Browse Code »

Each iteration of d_delete we reload inode from dentry->d_inode and
then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL
shortly afterwards unless something went horribly wrong.

Signed-off-by: Alan Cox
Signed-off-by: Al Viro

Alan Cox
2012-09-27 10:20:19 +0800

19 Sep, 2012

1 commit

b161dfa69 vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill() ... Browse Code »

IBM reported a soft lockup after applying the fix for the rename_lock
deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression
in kernel 2.6.38") was found to be the culprit.

The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
dentry was killed. This flag can be set on non-killed dentries too,
which results in infinite retries when trying to traverse the dentry
tree.

This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
only set in d_kill() and makes try_to_ascend() test only this flag.

IBM reported successful test results with this patch.

Signed-off-by: Miklos Szeredi
Cc: Trond Myklebust
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Miklos Szeredi
2012-09-19 02:23:51 +0800

14 Jul, 2012

3 commits

ee3efa91e __d_unalias() should refuse to move mountpoints ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-07-14 20:35:15 +0800
b3d9b7a3c vfs: switch i_dentry/d_alias to hlist ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-07-14 20:32:55 +0800
f7a99c5b7 get rid of ->mnt_longterm ... Browse Code »

it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()

Signed-off-by: Al Viro

Al Viro
2012-07-14 20:32:47 +0800

09 Jun, 2012

1 commit

32ba9c3fc Revert "vfs: stop d_splice_alias creating directory aliases" ... Browse Code »

This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit
3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up
cleanup).

We're chasing an elusive bug that Dave Jones can apparently reproduce
using his system call fuzzer tool, and that looks like some kind of
locking ordering problem on the directory i_mutex chain. Our i_mutex
locking is rather complex, and depends on the topological ordering of
the directories, which is why we have been very wary of splicing
directory entries around.

Of course, we really don't want to ever see aliased unconnected
directories anyway, so none of this should ever happen, but this revert
aims to basically get us back to a known older state.

Bruce points to some of the previous discussion at

http://marc.info/?i=

and in particular a long post from Neil:

http://marc.info/?i=

It should be noted that it's possible that Dave's problems come from
other changes altohgether, including possibly just the fact that Dave
constantly is teachning his fuzzer new tricks. So what appears to be a
new bug could in fact be an old one that just gets newly triggered, but
reverting these patches as "still under heavy discussion" is the right
thing regardless.

Requested-by: Al Viro
Acked-by: J. Bruce Fields
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-06-09 01:34:03 +0800

31 May, 2012

2 commits

3f50fff4d vfs: remove unused __d_splice_alias argument ... Browse Code »

Nobody sets want_disconn any more.

Reported-by: Peng Tao
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro

J. Bruce Fields
2012-05-31 09:04:54 +0800
7732a557b vfs: stop d_splice_alias creating directory aliases ... Browse Code »

A directory should never have more than one dentry pointing to it.

But d_splice_alias() will add one if it finds a directory with an
already-existing non-DISCONNECTED dentry.

I can't find an obvious reproducer, but I also can't see what prevents
d_splice_alias() from encountering such a case.

It therefore seems safest to allow d_splice_alias to use any dentry it
finds.

(Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0,
this could cause an nfsd deadlock like this:

- Somebody attempts to remove a non-empty directory.
- The dentry_unhash() in vfs_rmdir() unhashes the dentry
pointing to the non-empty directory.
- ->rmdir() then fails with -ENOTEMPTY
- Before the vfs_rmdir() caller reaches dput(), an nfsd process
in rename looks up the directory by filehandle; at the end of
that lookup, this dentry is found by d_alloc_anon(), and a
reference is taken on it, preventing dput() from removing it.
- A regular lookup of the directory calls d_splice_alias(),
finds only an unhashed (not a DISCONNECTED) dentry, and
insteads adds a new one, so the directory now has two
dentries.
- The nfsd process in rename, which was previously looking up
the source directory of the rename, now looks up the target
directory (which is the same), and gets the dentry newly
created by the previous lookup.
- The rename, seeing two different dentries, assumes this is a
cross-directory rename and attempts to take the i_mutex on the
directory twice.

That reproducer no longer exists, but I don't think there was anything
fundamentally incorrect about the vfs_rmdir() behavior there, so I think
the real fault was here in d_splice_alias().)

Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro

J. Bruce Fields
2012-05-31 09:04:54 +0800

30 May, 2012

1 commit

962830df3 brlocks/lglocks: API cleanups ... Browse Code »

lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.

In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.

The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen
Cc: Al Viro
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Rusty Russell
Signed-off-by: Al Viro

Andi Kleen
2012-05-30 11:28:41 +0800

24 May, 2012

1 commit

31fe62b95 mm: add a low limit to alloc_large_system_hash ... Browse Code »

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine
Reported-by: Tim Bird
Signed-off-by: Eric Dumazet
Cc: Paul Gortmaker
Signed-off-by: David S. Miller

Tim Bird
2012-05-24 12:28:21 +0800

22 May, 2012

2 commits

2e321806b Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu" ... Browse Code »

This reverts commit 8c01a529b861ba97c7d78368e6a5d4d42e946f75.

It turns out the d_unhashed() check isn't unnecessary after all: while
it's true that unhashing will increment the sequence numbers, that does
not necessarily invalidate the RCU lookup, because it might have seen
the dentry pointer (before it got unhashed), but by the time it loaded
the sequence number, it could have seen the *new* sequence number (after
it got unhashed).

End result: we might look up an unhashed dentry that is about to be
freed, with the sequence number never indicating anything bad about it.
So checking that the dentry is still hashed (*after* reading the sequence
number) is indeed the proper fix, and was never unnecessary.

Reported-by: Dave Jones
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-22 09:48:10 +0800
6326c71fd vfs: be even more careful about dentry RCU name lookups ... Browse Code »

Miklos Szeredi points out that we need to also worry about memory
odering when doing the dentry name comparison asynchronously with RCU.

In particular, doing a rename can do a memcpy() of one dentry name over
another, and we want to make sure that any unlocked reader will always
see the proper terminating NUL character, so that it won't ever run off
the allocation.

Rather than having to be extra careful with the name copy or at lookup
time for each character, this resolves the issue by making sure that all
names that are inlined in the dentry always have a NUL character at the
end of the name allocation. If we do that at dentry allocation time, we
know that no future name copy will ever change that final NUL to
anything else, so there are no memory ordering issues.

So even if a concurrent rename ends up overwriting the NUL character
that terminates the original name, we always know that there is one
final NUL at the end, and there is no worry about the lockless RCU
lookup traversing the name too far.

The out-of-line allocations are never copied over, so we can just make
sure that we write the name (with terminating NULL) and do a write
barrier before we expose the name to anything else by setting it in the
dentry.

Reported-by: Miklos Szeredi
Cc: Al Viro
Cc: Nick Piggin
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-22 07:14:04 +0800

11 May, 2012

4 commits

26fe57502 vfs: make it possible to access the dentry hash/len as one 64-bit entry ... Browse Code »

This allows comparing hash and len in one operation on 64-bit
architectures. Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.

The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer. This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-11 10:54:35 +0800
ee983e896 vfs: move dentry name length comparison from dentry_cmp() into callers ... Browse Code »

All callers do want to check the dentry length, but some of them can
check the length and the hash together, so doing it in dentry_cmp() can
be counter-productive.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-11 10:54:35 +0800
94753db5e vfs: do the careful dentry name access for all dentry_cmp cases ... Browse Code »

Commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
interfaces") did the careful ACCESS_ONCE() of the dentry name only for
the word-at-a-time case, even though the issue is generic.

Admittedly I don't really see gcc ever reloading the value in the middle
of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
issue. But better safe than sorry.

Also, this consolidates the common parts of the word-at-a-time and
bytewise logic, which includes checking the length. We'll be changing
that later.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-11 10:54:09 +0800
8c01a529b vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu ... Browse Code »

The check for d_unhashed() is not strictly incorrect, but at the same
time it is also not sensible. The actual dentry removal from the dentry
hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
we depend on __d_drop() updating the sequence number to invalidate any
lookup of an unhashed dentry.

So checking d_unhashed() is not incorrect, but it's not useful either:
the code has to work correctly even without it. So just remove it.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-11 10:52:35 +0800

05 May, 2012

1 commit

12f8ad4b0 vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces ... Browse Code »

The calling conventions for __d_lookup_rcu() and dentry_cmp() are
annoying in different ways, and there is actually one single underlying
reason for both of the annoyances.

The fundamental reason is that we do the returned dentry sequence number
check inside __d_lookup_rcu() instead of doing it in the caller. This
results in two annoyances:

- __d_lookup_rcu() now not only needs to return the dentry and the
sequence number that goes along with the lookup, it also needs to
return the inode pointer that was validated by that sequence number
check.

- and because we did the sequence number check early (to validate the
name pointer and length) we also couldn't just pass the dentry itself
to dentry_cmp(), we had to pass the counted string that contained the
name.

So that sequence number decision caused two separate ugly calling
conventions.

Both of these problems would be solved if we just did the sequence
number check in the caller instead. There's only one caller, and that
caller already has to do the sequence number check for the parent
anyway, so just do that.

That allows us to stop returning the dentry->d_inode in that in-out
argument (pointer-to-pointer-to-inode), so we can make the inode
argument just a regular input inode pointer. The caller can just load
the inode from dentry->d_inode, and then do the sequence number check
after that to make sure that it's synchronized with the name we looked
up.

And it allows us to just pass in the dentry to dentry_cmp(), which is
what all the callers really wanted. Sure, dentry_cmp() has to be a bit
careful about the dentry (which is not stable during RCU lookup), but
that's actually very simple.

And now that dentry_cmp() can clearly see that the first string argument
is a dentry, we can use the direct word access for that, instead of the
careful unaligned zero-padding. The dentry name is always properly
aligned, since it is a single path component that is either embedded
into the dentry itself, or was allocated with kmalloc() (see __d_alloc).

Finally, this also uninlines the nasty slow-case for dentry comparisons:
that one *does* need to do a sequence number check, since it will call
in to the low-level filesystems, and we want to give those a stable
inode pointer and path component length/start arguments. Doing an extra
sequence check for that slow case is not a problem, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-05 09:21:14 +0800

04 May, 2012

1 commit

e419b4cc5 vfs: make word-at-a-time accesses handle a non-existing page ... Browse Code »

It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
can have holes in the kernel address space: it seems to happen easily
with Xen, and it looks like the AMD gart64 code will also punch holes
dynamically.

Actually hitting that case is still very unlikely, so just do the
access, and take an exception and fix it up for the very unlikely case
of it being a page-crosser with no next page.

And hey, this abstraction might even help other architectures that have
other issues with unaligned word accesses than the possible missing next
page. IOW, this could do the byte order magic too.

Peter Anvin fixed a thinko in the shifting for the exception case.

Reported-and-tested-by: Jana Saout
Cc: Peter Anvin
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-05-04 05:01:40 +0800

29 Mar, 2012

1 commit

b18dafc86 vfs: fix d_ancestor() case in d_materialize_unique ... Browse Code »

In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
case; in two subcases the inode i_lock is properly released but this
does not occur in the -ELOOP subcase.

This seems to have been introduced by commit 1836750115f2 ("fix loop
checks in d_materialise_unique()").

Signed-off-by: Michel Lespinasse
Cc: stable@vger.kernel.org # v3.0+
[ Added a comment, and moved the unlock to where we generate the -ELOOP,
which seems to be more natural.

You probably can't actually trigger this without a buggy network file
server - d_materialize_unique() is for finding aliases on non-local
filesystems, and the d_ancestor() case is for a hardlinked directory
loop.

But we should be robust in the case of such buggy servers anyway. ]
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-03-29 00:54:34 +0800

25 Mar, 2012

1 commit

11bcb3284 Merge tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
"Fix up files in fs/ and lib/ dirs to only use module.h if they really
need it.

These are trivial in scope vs the work done previously. We now have
things where any few remaining cleanups can be farmed out to arch or
subsystem maintainers, and I have done so when possible. What is
remaining here represents the bits that don't clearly lie within a
single arch/subsystem boundary, like the fs dir and the lib dir.

Some duplicate includes arising from overlapping fixes from
independent subsystem maintainer submissions are also quashed."

Fix up trivial conflicts due to clashes with other include file cleanups
(including some due to the previous bug.h cleanup pull).

* tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
lib: reduce the use of module.h wherever possible
fs: reduce the use of module.h wherever possible
includecheck: delete any duplicate instances of module.h

Linus Torvalds
2012-03-25 01:24:31 +0800

23 Mar, 2012

1 commit

1f1e6e523 fs: fix kernel-doc warnings in dcache.c ... Browse Code »

Fix kernel-doc warnings in fs/dcache.c:

Warning(fs/dcache.c:1743): No description found for parameter 'seqp'
Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu'

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2012-03-23 06:49:18 +0800

22 Mar, 2012

1 commit

e2a0883e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...

Linus Torvalds
2012-03-22 04:36:41 +0800

21 Mar, 2012

1 commit

32991ab30 vfs: d_alloc_root() gone ... Browse Code »

all callers converted to d_make_root() by now

Signed-off-by: Al Viro

Al Viro
2012-03-21 09:29:37 +0800

20 Mar, 2012

2 commits

b0e37d7ac Merge branch 'dcache-word-accesses' ... Browse Code »

* branch 'dcache-word-accesses':
vfs: use 'unsigned long' accesses for dcache name comparison and hashing

This does the name hashing and lookup using word-sized accesses when
that is efficient, namely on x86 (although any little-endian machine
with good unaligned accesses would do).

It does very much depend on little-endian logic, but it's a very hot
couple of functions under some real loads, and this patch improves the
performance of __d_lookup_rcu() and link_path_walk() by up to about 30%.
Giving a 10% improvement on some very pathname-heavy benchmarks.

Because we do make unaligned accesses past the filename, the
optimization is disabled when CONFIG_DEBUG_PAGEALLOC is active, and we
effectively depend on the fact that on x86 we don't really ever have the
last page of usable RAM followed immediately by any IO memory (due to
ACPI tables, BIOS buffer areas etc).

Some of the bit operations we do are a bit "subtle". It's commented,
but you do need to really think about the code. Or just consider it
black magic.

Thanks to people on G+ for some of the optimized bit tricks.

Linus Torvalds
2012-03-20 07:37:28 +0800
6d7d1a0dc vfs: get rid of batshit-insane pointless dentry hash calculations ... Browse Code »

For some odd historical reason, the final mixing round for the dentry
cache hash table lookup had an insane "xor with big constant" logic. In
two places.

The big constant that is being xor'ed is GOLDEN_RATIO_PRIME, which is a
fairly random-looking number that is designed to be *multiplied* with so
that the bits get spread out over a whole long-word.

But xor'ing with it is insane. It doesn't really even change the hash -
it really only shifts the hash around in the hash table. To make
matters worse, the insane big constant is different on 32-bit and 64-bit
builds, even though the name hash bits we use are always 32-bit (and the
bits from the pointer we mix in effectively are too).

It's all total voodoo programming, in other words.

Now, some testing and analysis of the hash chains shows that the rest of
the hash function seems to be fairly good. It does pick the right bits
of the parent dentry pointer, for example, and while it's generally a
bad idea to use an xor to mix down the upper bits (because if there is a
repeating pattern, the xor can cause "destructive interference"), it
seems to not have been a disaster.

For example, replacing the hash with the normal "hash_long()" code (that
uses the GOLDEN_RATIO_PRIME constant correctly, btw) actually just makes
the hash worse. The hand-picked hash knew which bits of the pointer had
the highest entropy, and hash_long() ends up mixing bits less optimally
at least in some trivial tests.

So the hash function overall seems fine, it just has that really odd
"shift result around by a constant xor".

So get rid of the silly xor, and replace the down-mixing of the bits
with an add instead of an xor that tends to not have the same kind of
destructive interference issues. Some stats on the resulting hash
chains shows that they look statistically identical before and after,
but the code is simpler and no longer makes you go "WTF?".

Also, the incoming hash really is just "unsigned int", not a long, and
there's no real point to worry about the high 26 bits of the dentry
pointer for the 64-bit case, because they are all going to be identical
anyway.

So also change the hashing to be done in the more natural 'unsigned int'
that is the real size of the actual hashed data anyway.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-20 07:19:53 +0800

09 Mar, 2012

1 commit

bfcfaa77b vfs: use 'unsigned long' accesses for dcache name comparison and hashing ... Browse Code »

Ok, this is hacky, and only works on little-endian machines with goo
unaligned handling. And even then only with CONFIG_DEBUG_PAGEALLOC
disabled, since it can access up to 7 bytes after the pathname.

But it runs like a bat out of hell.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-09 10:08:44 +0800

05 Mar, 2012

1 commit

5483f18e9 vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.c ... Browse Code »

It's only used inside fs/dcache.c, and we're going to play games with it
for the word-at-a-time patches. This time we really don't even want to
export it, because it really is an internal function to fs/dcache.c, and
has been since it was introduced.

Having it in that extremely hot header file (it's included in pretty
much everything, thanks to ) is a disaster for testing
different versions, and is utterly pointless.

We really should have some kind of header file diet thing, where we
figure out which parts of header files are really better off private and
only result in more expensive compiles.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-05 07:51:42 +0800

03 Mar, 2012

1 commit

8966be903 vfs: trivial __d_lookup_rcu() cleanups ... Browse Code »

These don't change any semantics, but they clean up the code a bit and
mark some arguments appropriately 'const'.

They came up as I was doing the word-at-a-time dcache name accessor
code, and cleaning this up now allows me to send out a smaller relevant
interesting patch for the experimental stuff.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-03 06:23:30 +0800

29 Feb, 2012

1 commit

630d9c472 fs: reduce the use of module.h wherever possible ... Browse Code »

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2012-02-29 08:31:58 +0800

14 Feb, 2012

1 commit

074b85175 vfs: fix panic in __d_lookup() with high dentry hashtable counts ... Browse Code »

When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich
Acked-by: David S. Miller
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Dimitri Sivanich
2012-02-14 09:45:38 +0800

14 Jan, 2012

1 commit

1a52bb0b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)

Linus Torvalds
2012-01-14 02:29:21 +0800

13 Jan, 2012

1 commit

46f72b349 vfs: export symbol d_find_any_alias() ... Browse Code »

Ceph needs this.

Reviewed-by: Christoph Hellwig
Signed-off-by: Sage Weil

Sage Weil
2012-01-13 03:00:28 +0800

11 Jan, 2012

1 commit

eaf5f9073 fix shrink_dcache_parent() livelock ... Browse Code »

Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
cause shrink_dcache_parent() to loop forever.

Here's what appears to happen:

1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

2 - CPU1: select_parent(P) locks P->d_lock

3 - CPU0: shrink_dentry_list() locks C->d_lock
dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

4 - CPU1: select_parent(P) locks C->d_lock,
moves C from dispose list being processed on CPU0 to the new
dispose list, returns 1

5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

6 - Goto 2 with CPU0 and CPU1 switched

Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
it found a new one, causing shrink_dentry_list() to think it's making progress
and loop over and over.

One way to trigger this is to make udev calls stat() on the sysfs file while it
is going away.

Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

Then execute the following loop:

while true; do
echo -bond0 > /sys/class/net/bonding_masters
echo +bond0 > /sys/class/net/bonding_masters
echo -bond1 > /sys/class/net/bonding_masters
echo +bond1 > /sys/class/net/bonding_masters
done

One fix would be to check all callers and prevent concurrent calls to
shrink_dcache_parent(). But I think a better solution is to stop the
stealing behavior.

This patch adds a new dentry flag that is set when the dentry is added to the
dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
new reference just before being pruned.

If the dentry has this flag, select_parent() will skip it and let
shrink_dentry_list() retry pruning it. With select_parent() skipping those
dentries there will not be the appearance of progress (new dentries found) when
there is none, hence shrink_dcache_parent() will not loop forever.

Set the flag is also set in prune_dcache_sb() for consistency as suggested by
Linus.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-01-11 02:06:32 +0800

10 Jan, 2012

2 commits

adc0e91ab vfs: new helper - d_make_root() ... Browse Code »

d_alloc_root() with iput() in case of allocation failure...

Signed-off-by: Al Viro

Al Viro
2012-01-10 08:23:45 +0800
b48f03b31 dcache: use a dispose list in select_parent ... Browse Code »

select_parent currently abuses the dentry cache LRU to provide
cleanup features for child dentries that need to be freed. It moves
them to the tail of the LRU, then tells shrink_dcache_parent() to
calls __shrink_dcache_sb to unconditionally move them to a dispose
list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
relock the dentries to move them off the LRU onto the dispose list,
but otherwise does not touch the dentries that select_parent() moved
to the tail of the LRU. It then passses the dispose list to
shrink_dentry_list() which tries to free the dentries.

IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
exactly the same list of dentries for disposal directly in
select_parent() and call shrink_dentry_list() instead of calling
__shrink_dcache_sb() to do that. This means that we avoid long holds
on the lru lock walking the LRU moving dentries to the dispose list
We also avoid the need to relock each dentry just to move it off the
LRU, reducing the numebr of times we lock each dentry to dispose of
them in shrink_dcache_parent() from 3 to 2 times.

Further, we remove one of the two callers of __shrink_dcache_sb().
This also means that __shrink_dcache_sb can be moved into back into
prune_dcache_sb() and we no longer have to handle referenced
dentries conditionally, simplifying the code.

Signed-off-by: Dave Chinner
Signed-off-by: Linus Torvalds
Signed-off-by: Al Viro

Dave Chinner
2012-01-10 08:22:52 +0800