Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2016

1 commit

d29216842 mnt: Add a per mount namespace limit on the number of mounts ... Browse Code »

CAI Qian pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-10-01 01:46:48 +0800

24 Jun, 2016

1 commit

380cf5ba6 fs: Treat foreign mounts as nosuid ... Browse Code »

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski
Signed-off-by: Seth Forshee
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Andy Lutomirski
2016-06-24 23:40:41 +0800

18 Apr, 2015

1 commit

8f502d5b9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull usernamespace mount fixes from Eric Biederman:
"Way back in October Andrey Vagin reported that umount(MNT_DETACH)
could be used to defeat MNT_LOCKED. As I worked to fix this I
discovered that combined with mount propagation and an appropriate
selection of shared subtrees a reference to a directory on an
unmounted filesystem is not necessary.

That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.

To avoid breaking existing userspace the conflict between MNT_DETACH
and MNT_LOCKED is fixed by leaving mounts that are locked to their
parents in the mount hash table until the last reference goes away.

While investigating this issue I also found an issue with
__detach_mounts. The code was unnecessarily and incorrectly
triggering mount propagation. Resulting in too many mounts going away
when a directory is deleted, and too many cpu cycles are burned while
doing that.

Looking some more I realized that __detach_mounts by only keeping
mounts connected that were MNT_LOCKED it had the potential to still
leak information so I tweaked the code to keep everything locked
together that possibly could be.

This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting, and
I have not been in top form for a while so it took me a while to get
all of that done. Similiarly this pull request is late because I have
been feeling absolutely miserable all week.

The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Update detach_mounts to leave mounts connected
mnt: Fix the error check in __detach_mounts
mnt: Honor MNT_LOCKED when detaching mounts
fs_pin: Allow for the possibility that m_list or s_list go unused.
mnt: Factor umount_mnt from umount_tree
mnt: Factor out unhash_mnt from detach_mnt and umount_tree
mnt: Fail collect_mounts when applied to unmounted mounts
mnt: Don't propagate unmounts to locked mounts
mnt: On an unmount propagate clearing of MNT_LOCKED
mnt: Delay removal from the mount hash.
mnt: Add MNT_UMOUNT flag
mnt: In umount_tree reuse mnt_list instead of mnt_hash
mnt: Don't propagate umounts in __detach_mounts
mnt: Improve the umount_tree flags
mnt: Use hlist_move_list in namespace_unlock

Linus Torvalds
2015-04-18 23:20:31 +0800

16 Apr, 2015

1 commit

e6e20a7a5 init: export name_to_dev_t and mark name argument as const ... Browse Code »

DM will switch its device lookup code to using name_to_dev_t() so it
must be exported. Also, the @name argument should be marked const.

Signed-off-by: Dan Ehrenberg
Signed-off-by: Mike Snitzer

Dan Ehrenberg
2015-04-16 00:10:18 +0800

03 Apr, 2015

1 commit

590ce4bcb mnt: Add MNT_UMOUNT flag ... Browse Code »

In some instances it is necessary to know if the the unmounting
process has begun on a mount. Add MNT_UMOUNT to make that reliably
testable.

This fix gets used in fixing locked mounts in MNT_DETACH

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-04-03 09:34:18 +0800

24 Oct, 2014

1 commit

c771d683a vfs: introduce clone_private_mount() ... Browse Code »

Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2014-10-24 06:14:36 +0800

12 Aug, 2014

1 commit

f6f993328 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Stuff in here:

- acct.c fixes and general rework of mnt_pin mechanism. That allows
to go for delayed-mntput stuff, which will permit mntput() on deep
stack without worrying about stack overflows - fs shutdown will
happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
series without introducing tons of stack overflows on new mntput()
call chains it introduces.
- Bruce's d_splice_alias() patches
- more Miklos' rename() stuff.
- a couple of regression fixes (stable fodder, in the end of branch)
and a fix for API idiocy in iov_iter.c.

There definitely will be another pile, maybe even two. I'd like to
get Eric's series in this time, but even if we miss it, it'll go right
in the beginning of for-next in the next cycle - the tricky part of
prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
fix copy_tree() regression
__generic_file_write_iter(): fix handling of sync error after DIO
switch iov_iter_get_pages() to passing maximal number of pages
fs: mark __d_obtain_alias static
dcache: d_splice_alias should detect loops
exportfs: update Exporting documentation
dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
dcache: remove unused d_find_alias parameter
dcache: d_obtain_alias callers don't all want DISCONNECTED
dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
dcache: d_splice_alias mustn't create directory aliases
dcache: close d_move race in d_splice_alias
dcache: move d_splice_alias
namei: trivial fix to vfs_rename_dir comment
VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
cifs: support RENAME_NOREPLACE
hostfs: support rename flags
shmem: support RENAME_EXCHANGE
shmem: support RENAME_NOREPLACE
btrfs: add RENAME_NOREPLACE
...

Linus Torvalds
2014-08-12 02:44:11 +0800

08 Aug, 2014

1 commit

3064c3563 death to mnt_pinned ... Browse Code »

Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone. Then attach the pin to original
vfsmount. Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out. Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.

mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).

Signed-off-by: Al Viro

Al Viro
2014-08-08 02:40:09 +0800

01 Aug, 2014

2 commits

9566d6742 mnt: Correct permission checks in do_remount ... Browse Code »

While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.

In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked. These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.

The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.

The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled. Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.

The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.

Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-08-01 08:12:34 +0800
a6138db81 mnt: Only change user settable mount flags in remount ... Browse Code »

Kenton Varda discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.

Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-08-01 08:11:54 +0800

02 Apr, 2014

1 commit

f2ebb3a92 smarter propagate_mnt() ... Browse Code »

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint. That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations. It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups. And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}. It means that the master
of M_{k} will be among the sequence of masters of M. On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P). Let N be the one before it. Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple. Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:08 +0800

09 Nov, 2013

1 commit

48a066e72 RCU'd vfsmounts ... Browse Code »

* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
number against seq, then grabs reference to mnt. Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode. We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock. It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.

Signed-off-by: Al Viro

Al Viro
2013-11-09 13:16:19 +0800

25 Jul, 2013

1 commit

5ff9d8a65 vfs: Lock in place mounts from more privileged users ... Browse Code »

When creating a less privileged mount namespace or propogating mounts
from a more privileged to a less privileged mount namespace lock the
submounts so they may not be unmounted individually in the child mount
namespace revealing what is under them.

This enforces the reasonable expectation that it is not possible to
see under a mount point. Most of the time mounts are on empty
directories and revealing that does not matter, however I have seen an
occassionaly sloppy configuration where there were interesting things
concealed under a mount point that probably should not be revealed.

Expirable submounts are not locked because they will eventually
unmount automatically so whatever is under them already needs
to be safe for unprivileged users to access.

From a practical standpoint these restrictions do not appear to be
significant for unprivileged users of the mount namespace. Recursive
bind mounts and pivot_root continues to work, and mounts that are
created in a mount namespace may be unmounted there. All of which
means that the common idiom of keeping a directory of interesting
files and using pivot_root to throw everything else away continues to
work just fine.

Acked-by: Serge Hallyn
Acked-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-07-25 00:14:46 +0800

27 Mar, 2013

1 commit

90563b198 vfs: Add a mount flag to lock read only bind mounts ... Browse Code »

When a read-only bind mount is copied from mount namespace in a higher
privileged user namespace to a mount namespace in a lesser privileged
user namespace, it should not be possible to remove the the read-only
restriction.

Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
remain read-only.

CC: stable@vger.kernel.org
Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-03-27 22:50:04 +0800

04 Jan, 2012

15 commits

c63181e6b vfs: move fsnotify junk to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:12 +0800
52ba1621d vfs: move mnt_devname ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:11 +0800
1a4eeaf2a vfs: move mnt_list to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:11 +0800
863d684f9 vfs: move the rest of int fields to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:10 +0800
15169fe78 vfs: mnt_id/mnt_group_id moved ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:10 +0800
143c8c91c vfs: mnt_ns moved to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:09 +0800
6776db3d3 vfs: take mnt_share/mnt_slave/mnt_slave_list and mnt_expire to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:08 +0800
d10e8def0 vfs: take mnt_master to struct mount ... Browse Code »

make IS_MNT_SLAVE take struct mount * at the same time

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:08 +0800
6b41d536f vfs: take mnt_child/mnt_mounts to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:06 +0800
68e8a9fea vfs: all counters taken to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:06 +0800
a73324da7 vfs: move mnt_mountpoint to struct mount ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:05 +0800
3376f34ff vfs: mnt_parent moved to struct mount ... Browse Code »

the second victim...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:04 +0800
1b8e5564b vfs: the first spoils - mnt_hash moved ... Browse Code »

taken out of struct vfsmount into struct mount

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:57:02 +0800
2a79f17e4 vfs: mnt_drop_write_file() ... Browse Code »

new helper (wrapper around mnt_drop_write()) to be used in pair with
mnt_want_write_file().

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:40 +0800
79e801a90 vfs: make do_kern_mount() static ... Browse Code »

the only user outside of fs/namespace.c has died

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:39 +0800

27 Jul, 2011

1 commit

60063497a atomic: use <linux/atomic.h> ... Browse Code »

This allows us to move duplicated code in
(atomic_inc_not_zero() for now) to

Signed-off-by: Arun Sharma
Reviewed-by: Eric Dumazet
Cc: Ingo Molnar
Cc: David Miller
Cc: Eric Dumazet
Acked-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun Sharma
2011-07-27 07:49:47 +0800

17 Jan, 2011

1 commit

f03c65993 sanitize vfsmount refcounting changes ... Browse Code »

Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro

Al Viro
2011-01-17 02:47:07 +0800

16 Jan, 2011

1 commit

ea5b778a8 Unexport do_add_mount() and add in follow_automount(), not ->d_automount() ... Browse Code »

Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:48 +0800

07 Jan, 2011

1 commit

b3e19d924 fs: scale mntget/mntput ... Browse Code »

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:33 +0800

11 Aug, 2010

1 commit

532490f0a vfs: remove unused MNT_STRICTATIME ... Browse Code »

Commit d0adde574b8487ef30f69e2d08bba769e4be513f added MNT_STRICTATIME
but it isn't actually used (MS_STRICTATIME clears MNT_RELATIME and
MNT_NOATIME rather than setting any mount flag).

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2010-08-11 12:29:47 +0800

28 Jul, 2010

1 commit

2504c5d63 fsnotify/vfsmount: add fsnotify fields to struct vfsmount ... Browse Code »

This patch adds the list and mask fields needed to support vfsmount marks.
These are the same fields fsnotify needs on an inode. They are not used,
just declared and we note where the cleanup hook should be (the function is
not yet defined)

Signed-off-by: Andreas Gruenbacher
Signed-off-by: Eric Paris

Andreas Gruenbacher
2010-07-28 21:58:57 +0800

05 Mar, 2010

1 commit

0f2cc4ecd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c

Linus Torvalds
2010-03-05 00:15:33 +0800

04 Mar, 2010

3 commits

8089352a1 Mirror MS_KERNMOUNT in ->mnt_flags ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-04 03:08:00 +0800
47cd813f2 Take vfsmount_lock to fs/internal.h ... Browse Code »

no more users left outside of fs/*.c (and very few outside of
fs/namespace.c, actually)

Signed-off-by: Al Viro

Al Viro
2010-03-04 03:07:59 +0800
495d6c9c6 VFS: Clean up shared mount flag propagation ... Browse Code »

The handling of mount flags in set_mnt_shared() got a little tangled
up during previous cleanups, with the following problems:

* MNT_PNODE_MASK is defined as a literal constant when it should be a
bitwise xor of other MNT_* flags
* set_mnt_shared() clears and then sets MNT_SHARED (part of MNT_PNODE_MASK)
* MNT_PNODE_MASK could use a comment in mount.h
* MNT_PNODE_MASK is a terrible name, change to MNT_SHARED_MASK

This patch fixes these problems.

Signed-off-by: Al Viro

Valerie Aurora
2010-03-04 03:07:55 +0800

17 Feb, 2010

1 commit

003cb608a percpu: add __percpu sparse annotations to fs ... Browse Code »

Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Trond Myklebust
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: Alexander Viro

Tejun Heo
2010-02-17 10:17:38 +0800