15 Nov, 2014
1 commit
-
commit 69a91c237ab0ebe4e9fdeaf6d0090c85275594ec upstream.
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:Note that this mode applies only to future accesses of the newly
created file; the open() call that creates a read-only file
may well return a read/write file descriptor.The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.O_TMPFILE, however, behaves differently:
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
assert(fd == -1);
assert(errno == EACCES);int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
assert(fd > 0);For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
if (*opened & FILE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = MAY_OPEN;
path_to_nameidata(path, nd);
goto finish_open_created;
}But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.Signed-off-by: Eric Rannaud
Acked-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
06 Oct, 2014
1 commit
-
commit 7bd88377d482e1eae3c5329b12e33cfd664fa6a9 upstream.
return the value instead, and have path_init() do the assignment. Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination. This one should go where
it went.To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers. And do the same to set_root(), to keep them
in sync.Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman
18 Sep, 2014
2 commits
-
commit 99d263d4c5b2f541dfacb5391e22e8c91ea982a6 upstream.
Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:"The test case is essentially
for (i = 0; i < 1000000; i++)
mkdir("a$i");On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
dir/sec with 3.10. This is because we spend waaaaay more time in
__d_lookup on 3.10 than in 3.2.The new hashing function for strings is suboptimal for <
sizeof(unsigned long) string names (and hell even > sizeof(unsigned
long) string names that I've tested). I broke out the old hashing
function and the new one into a userspace helper to get real numbers
and this is what I'm getting:Old hash table had 1000000 entries, 0 dupes, 0 max dupes
New hash table had 12628 entries, 987372 dupes, 900 max dupes
We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hashMy test does the hash, and then does the d_hash into a integer pointer
array the same size as the dentry hash table on my system, and then
just increments the value at the address we got to see how many
entries we overlap with.As you can see the old hash function ended up with all 1 million
entries in their own bucket, whereas the new one they are only
distributed among ~12.5k buckets, which is why we're using so much
more CPU in __d_lookup".The reason for this hash regression is two-fold:
- On 64-bit architectures the down-mixing of the original 64-bit
word-at-a-time hash into the final 32-bit hash value is very
simplistic and suboptimal, and just adds the two 32-bit parts
together.In particular, because there is no bit shuffling and the mixing
boundary is also a byte boundary, similar character patterns in the
low and high word easily end up just canceling each other out.- the old byte-at-a-time hash mixed each byte into the final hash as it
hashed the path component name, resulting in the low bits of the hash
generally being a good source of hash data. That is not true for the
word-at-a-time case, and the hash data is distributed among all the
bits.The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible. We already have the
"hash_32|64()" functions to do that.Reported-by: Josef Bacik
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Chris Mason
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 44b1d53043c482225196e8a9cd9f35163a1b3336 upstream.
Add d_is_dir(dentry) helper which is analogous to S_ISDIR().
To avoid confusion, rename d_is_directory() to d_can_lookup().
Signed-off-by: Miklos Szeredi
Reviewed-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman
01 Aug, 2014
1 commit
-
commit 295dc39d941dc2ae53d5c170365af4c9d5c16212 upstream.
Currently umount on symlink blocks following umount:
/vz is separate mount
# ls /vz/ -al | grep test
drwxr-xr-x. 2 root root 4096 Jul 19 01:14 testdir
lrwxrwxrwx. 1 root root 11 Jul 19 01:16 testlink -> /vz/testdir
# umount -l /vz/testlink
umount: /vz/testlink: not mounted (expected)# lsof /vz
# umount /vz
umount: /vz: device is busy. (unexpected)In this case mountpoint_last() gets an extra refcount on path->mnt
Signed-off-by: Vasily Averin
Acked-by: Ian Kent
Acked-by: Jeff Layton
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman
17 Jun, 2014
1 commit
-
commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.Fixes CVE-2014-4014.
Cc: Theodore Ts'o
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Chinner
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
08 Jun, 2014
1 commit
-
commit 22213318af7ae265bc6cd8aef2febbc2d69a2440 upstream.
in non-lazy walk we need to be careful about dentry switching from
negative to positive - both ->d_flags and ->d_inode are updated,
and in some places we might see only one store. The cases where
dentry has been obtained by dcache lookup with ->i_mutex held on
parent are safe - ->d_lock and ->i_mutex provide all the barriers
we need. However, there are several places where we run into
trouble:
* do_last() fetches ->d_inode, then checks ->d_flags and
assumes that inode won't be NULL unless d_is_negative() is true.
Race with e.g. creat() - we might have fetched the old value of
->d_inode (still NULL) and new value of ->d_flags (already not
DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting
oops.
* a bunch of places checks ->d_inode for being non-NULL,
then checks ->d_flags for "is it a symlink". Race with symlink(2)
in case if our CPU sees ->d_inode update first - we see non-NULL
there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow
link here?", with subsequent unpleasantness.Reported-and-tested-by: Lin Ming
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman
23 Mar, 2014
1 commit
-
We can get false negative from __lookup_mnt() if an unrelated vfsmount
gets moved. In that case legitimize_mnt() is guaranteed to fail,
and we will fall back to non-RCU walk... unless we end up running
into a hard error on a filesystem object we wouldn't have reached
if not for that false negative. IOW, delaying that check until
the end of pathname resolution is wrong - we should recheck right
after we attempt to cross the mountpoint. We don't need to recheck
unless we see d_mountpoint() being true - in that case even if
we have just raced with mount/umount, we can simply go on as if
we'd come at the moment when the sucker wasn't a mountpoint; if we
run into a hard error as the result, it was a legitimate outcome.
__lookup_mnt() returning NULL is different in that respect, since
it might've happened due to operation on completely unrelated
mountpoint.Cc: stable@vger.kernel.org
Signed-off-by: Al Viro
10 Mar, 2014
1 commit
-
Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
"2.9.7 Thread Interactions with Regular File Operations
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links: [...]"and one of the effects is the file position update.
This unprotected file position behavior is not new behavior, and nobody
has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.Reported-by: Yongzhi Pan
Reported-by: Michael Kerrisk
Cc: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Al Viro
06 Feb, 2014
1 commit
-
This changes 'do_execve()' to get the executable name as a 'struct
filename', and to free it when it is done. This is what the normal
users want, and it simplifies and streamlines their error handling.The controlled lifetime of the executable name also fixes a
use-after-free problem with the trace_sched_process_exec tracepoint: the
lifetime of the passed-in string for kernel users was not at all
obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
the pathname allocation lifetime with the execve() having finished,
which in turn meant that the trace point that happened after
mm_release() of the old process VM ended up using already free'd memory.To solve the kernel string lifetime issue, this simply introduces
"getname_kernel()" that works like the normal user-space getname()
function, except with the source coming from kernel memory.As Oleg points out, this also means that we could drop the tcomm[] array
from 'struct linux_binprm', since the pathname lifetime now covers
setup_new_exec(). That would be a separate cleanup.Reported-by: Igor Zhbanov
Tested-by: Steven Rostedt
Cc: Oleg Nesterov
Cc: Al Viro
Signed-off-by: Linus Torvalds
01 Feb, 2014
2 commits
-
Recent changes to retry on ESTALE in linkat
(commit 442e31ca5a49e398351b2954b51f578353fdf210)
introduced a mountpoint reference leak and a small memory
leak in case a filesystem link operation returns ESTALE
which is pretty normal for distributed filesystems like
lustre, nfs and so on.
Free old_path in such a case.[AV: there was another missing path_put() nearby - on the previous
goto retry]Signed-off-by: Oleg Drokin:
Signed-off-by: Al Viro -
Leaving getname() exported when putname() isn't is a bad idea.
Signed-off-by: Jeff Layton
Signed-off-by: Al Viro
26 Jan, 2014
1 commit
-
Factor out the code to get an ACL either from the inode or disk from
check_acl, so that it can be used elsewhere later on.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Al Viro
13 Dec, 2013
1 commit
-
When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.Cc: Al Viro
Signed-off-by: Will Deacon
Signed-off-by: Linus Torvalds
29 Nov, 2013
1 commit
-
Failure to grab reference to parent dentry should go through the
same cleanup as nd->seq mismatch. As it is, we might end up with
caller thinking it needs to path_put() nd->root, with obvious
nasty results once we'd hit that bug enough times to drive the
refcount of root dentry all the way to zero...Signed-off-by: Al Viro
22 Nov, 2013
1 commit
-
Pull audit updates from Eric Paris:
"Nothing amazing. Formatting, small bug fixes, couple of fixes where
we didn't get records due to some old VFS changes, and a change to how
we collect execve info..."Fixed conflict in fs/exec.c as per Eric and linux-next.
* git://git.infradead.org/users/eparis/audit: (28 commits)
audit: fix type of sessionid in audit_set_loginuid()
audit: call audit_bprm() only once to add AUDIT_EXECVE information
audit: move audit_aux_data_execve contents into audit_context union
audit: remove unused envc member of audit_aux_data_execve
audit: Kill the unused struct audit_aux_data_capset
audit: do not reject all AUDIT_INODE filter types
audit: suppress stock memalloc failure warnings since already managed
audit: log the audit_names record type
audit: add child record before the create to handle case where create fails
audit: use given values in tty_audit enable api
audit: use nlmsg_len() to get message payload length
audit: use memset instead of trying to initialize field by field
audit: fix info leak in AUDIT_GET requests
audit: update AUDIT_INODE filter rule to comparator function
audit: audit feature to set loginuid immutable
audit: audit feature to only allow unsetting the loginuid
audit: allow unsetting the loginuid (with priv)
audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
audit: loginuid functions coding style
selinux: apply selinux checks on new audit message types
...
13 Nov, 2013
1 commit
-
Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixesplus misc stuff all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...
09 Nov, 2013
10 commits
-
Cc: Tyler Hicks
Cc: Dustin Kirkland
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
Cc: David Howells
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
We'll need the same logic for rename and link.
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retryIt is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.Cc: David Howells
Cc: Tyler Hicks
Cc: Dustin Kirkland
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
We'll be using dentry->d_inode in one more place.
Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
A read delegation is used by NFSv4 as a guarantee that a client can
perform local read opens without informing the server.The open operation takes the last component of the pathname as an
argument, thus is also a lookup operation, and giving the client the
above guarantee means informing the client before we allow anything that
would change the set of names pointing to the inode.Therefore, we need to break delegations on rename, link, and unlink.
We also need to prevent new delegations from being acquired while one of
these operations is in progress.We could add some completely new locking for that purpose, but it's
simpler to use the i_mutex, since that's already taken by all the
operations we care about.The single exception is rename. So, modify rename to take the i_mutex
on the file that is being renamed.Also fix up lockdep and Documentation/filesystems/directory-locking to
reflect the change.Acked-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
The DCACHE_NEED_LOOKUP case referred to here was removed with
39e3c9553f34381a1b664c27b0c696a266a5735e "vfs: remove
DCACHE_NEED_LOOKUP".There are only four real_lookup() callers and all of them pass in an
unhashed dentry just returned from d_alloc.Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro -
Put a type field into struct dentry::d_flags to indicate if the dentry is one
of the following types that relate particularly to pathwalk:Miss (negative dentry)
Directory
"Automount" directory (defective - no i_op->lookup())
Symlink
Other (regular, socket, fifo, device)The type field is set to one of the first five types on a dentry by calls to
__d_instantiate() and d_obtain_alias() from information in the inode (if one is
given).The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
dentry as a negative dentry.Accessors provided are:
d_set_type(dentry, type)
d_is_directory(dentry)
d_is_autodir(dentry)
d_is_symlink(dentry)
d_is_file(dentry)
d_is_negative(dentry)
d_is_positive(dentry)A bunch of checks in pathname resolution switched to those.
Signed-off-by: David Howells
Signed-off-by: Al Viro -
those have become aliases for rcu_read_{lock,unlock}()
Signed-off-by: Al Viro
-
* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
number against seq, then grabs reference to mnt. Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode. We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock. It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.Signed-off-by: Al Viro
06 Nov, 2013
1 commit
-
Historically, when a syscall that creates a dentry fails, you get an audit
record that looks something like this (when trying to create a file named
"new" in "/tmp/tmp.SxiLnCcv63"):type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023
This record makes no sense since it's associating the inode information for
"/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent
patch I posted to fix the audit_inode call in do_last fixes this, by making it
look more like this:type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023
While this is more correct, if the creation of the file fails, then we
have no record of the filename that the user tried to create.This patch adds a call to audit_inode_child to may_create. This creates
an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the
create succeeds. When and if the create does succeed, then this record
will be updated with the correct inode info from the create.This fixes what was broken in commit bfcec708.
Commit 79f6530c should also be backported to stable v3.7+.Signed-off-by: Jeff Layton
Signed-off-by: Eric Paris
Signed-off-by: Richard Guy Briggs
Signed-off-by: Eric Paris
25 Oct, 2013
1 commit
-
Instead of passing the direction as argument (and checking it on every
step through the hash chain), just have separate __lookup_mnt() and
__lookup_mnt_last(). And use the standard iterators...Signed-off-by: Al Viro
22 Oct, 2013
1 commit
-
Add @path parameter to fix kernel-doc warning.
Also fix a spello/typo.Warning(fs/namei.c:2304): No description found for parameter 'path'
Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds
18 Sep, 2013
1 commit
-
Signed-off-by: Al Viro
17 Sep, 2013
1 commit
-
If O_CREAT|O_EXCL are passed to open, then we know that either
- the file is successfully created, or
- the operation fails in some way.So previously we set FILE_CREATED before calling ->atomic_open() so the
filesystem doesn't have to. This, however, led to bugs in the
implementation that went unnoticed when the filesystem didn't check for
existence, yet returned success. To prevent this kind of bug, require
filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
verify this in the VFS.Also added a couple more verifications for the result of atomic_open():
- Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
- Warn if filesystem set FILE_CREATED but gave a negative dentry.Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
11 Sep, 2013
5 commits
-
Signed-off-by: Dave Jones
Signed-off-by: Al Viro -
Signed-off-by: Al Viro
-
For a long time no filesystem has been using vfs_follow_link, and as seen
by recent filesystem submissions any new use is accidental as well.Remove vfs_follow_link, document the replacement in
Documentation/filesystems/porting and also rename __vfs_follow_link
to match its only caller better.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Pull vfs pile 3 (of many) from Al Viro:
"Waiman's conversion of d_path() and bits related to it,
kern_path_mountpoint(), several cleanups and fixes (exportfs
one is -stable fodder, IMO).There definitely will be more... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
dcache: Translating dentry into pathname without taking rename_lock
autofs4 - fix device ioctl mount lookup
introduce kern_path_mountpoint()
rename user_path_umountat() to user_path_mountpoint_at()
take unlazy_walk() into umount_lookup_last()
Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
prune_super(): sb->s_op is never NULL
exportfs: don't assume that ->iterate() won't feed us too long entries
afs: get rid of redundant ->d_name.len checks -
When I moved the RCU walk termination into unlazy_walk(), I didn't copy
quite all of it: for the successful RCU termination we properly add the
necessary reference counts to our temporary copy of the root path, but
for the failure case we need to make sure that any temporary root path
information is cleared out (since it does _not_ have the proper
reference counts from the RCU lookup).We could clean up this mess by just always dropping the temporary root
information, but Al points out that that would mean that a single lookup
through symlinks could see multiple different root entries if it races
with another thread doing chroot. Not that I think we should really
care (we had that before too, back before we had a copy of the root path
in the nameidata).Al says he has a cunning plan. In the meantime, this is the minimal fix
for the problem, even if it's not all that pretty.Reported-by: Mace Moneta
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
09 Sep, 2013
3 commits
-
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case. Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.Acked-by: Al Viro
Signed-off-by: Linus Torvalds -
Signed-off-by: Al Viro
-
... and move the extern from linux/namei.h to fs/internal.h,
along with that of vfs_path_lookup().Signed-off-by: Al Viro