Eric Lee / smarc-fsl-linux-kernel

21 Nov, 2018

3 commits

f3f529741 mount: Prevent MNT_DETACH from disconnecting locked mounts ... Browse Code »

commit 9c8e0a1b683525464a2abe9fb4b54404a50ed2b4 upstream.

Timothy Baldwin wrote:
> As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
>
> Mounts that come as a single unit from more privileged mount are locked
> together and may not be separated in a less privileged mount namespace.
>
> However they can:
>
> 1. Create a mount namespace.
> 2. In the mount namespace open a file descriptor to the parent of a mount point.
> 3. Destroy the mount namespace.
> 4. Use the file descriptor to look under the mount point.
>
> I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
>
> The setup:
>
> $ sudo sysctl kernel.unprivileged_userns_clone=1
> kernel.unprivileged_userns_clone = 1
> $ mkdir -p A/B/Secret
> $ sudo mount -t tmpfs hide A/B
>
>
> "Secret" is indeed hidden as expected:
>
> $ ls -lR A
> A:
> total 0
> drwxrwxrwt 2 root root 40 Feb 12 21:08 B
>
> A/B:
> total 0
>
>
> The attack revealing "Secret":
>
> $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2018-11-21 16:24:14 +0800
c4585861f mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts ... Browse Code »

commit df7342b240185d58d3d9665c0bbf0a0f5570ec29 upstream.

Jonathan Calmels from NVIDIA reported that he's able to bypass the
mount visibility security check in place in the Linux kernel by using
a combination of the unbindable property along with the private mount
propagation option to allow a unprivileged user to see a path which
was purposefully hidden by the root user.

Reproducer:
# Hide a path to all users using a tmpfs
root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
root@castiana:~#

# As an unprivileged user, unshare user namespace and mount namespace
stgraber@castiana:~$ unshare -U -m -r

# Confirm the path is still not accessible
root@castiana:~# ls /sys/devices/

# Make /sys recursively unbindable and private
root@castiana:~# mount --make-runbindable /sys
root@castiana:~# mount --make-private /sys

# Recursively bind-mount the rest of /sys over to /mnnt
root@castiana:~# mount --rbind /sys/ /mnt

# Access our hidden /sys/device as an unprivileged user
root@castiana:~# ls /mnt/devices/
breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual

Solve this by teaching copy_tree to fail if a mount turns out to be
both unbindable and locked.

Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Reported-by: Jonathan Calmels
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2018-11-21 16:24:13 +0800
5e64ee878 mount: Retest MNT_LOCKED in do_umount ... Browse Code »

commit 25d202ed820ee347edec0bf3bf553544556bf64b upstream.

It was recently pointed out that the one instance of testing MNT_LOCKED
outside of the namespace_sem is in ksys_umount.

Fix that by adding a test inside of do_umount with namespace_sem and
the mount_lock held. As it helps to fail fails the existing test is
maintained with an additional comment pointing out that it may be racy
because the locks are not held.

Cc: stable@vger.kernel.org
Reported-by: Al Viro
Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2018-11-21 16:24:13 +0800

20 Oct, 2018

1 commit

f473bf921 Revert "vfs: fix freeze protection in mnt_want_write_file() for overlayfs" ... Browse Code »

This reverts commit 4f4374a9bd25b333971e6f2656b642d29e2efe7b which was
commit a6795a585929d94ca3e931bc8518f8deb8bbe627 upstream.

Turns out this causes problems and was to fix a patch only in the 4.19
and newer tree.

Reported-by: Amir Goldstein
Cc: Miklos Szeredi
Cc: Christoph Hellwig
Cc: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2018-10-20 15:48:52 +0800

26 Sep, 2018

1 commit

4f4374a9b vfs: fix freeze protection in mnt_want_write_file() for overlayfs ... Browse Code »

[ Upstream commit a6795a585929d94ca3e931bc8518f8deb8bbe627 ]

The underlying real file used by overlayfs still contains the overlay path.
This results in mnt_want_write_file() calls by the filesystem getting
freeze protection on the wrong inode (the overlayfs one instead of the real
one).

Fix by using file_inode(file)->i_sb instead of file->f_path.mnt->mnt_sb.

Reported-by: Amir Goldstein
Signed-off-by: Miklos Szeredi
Reviewed-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2018-09-26 14:38:09 +0800

16 Aug, 2018

2 commits

e5751c844 fix __legitimize_mnt()/mntput() race ... Browse Code »

commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream.

__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.

Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped. Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.

It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there. The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...

Reported-by: Oleg Nesterov
Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2018-08-16 00:12:48 +0800
d9d46a226 fix mntput/mntput race ... Browse Code »

commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream.

mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test. Consider the following situation:
* mnt is a lazy-umounted mount, kept alive by two opened files. One
of those files gets closed. Total refcount of mnt is 2. On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69. There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
* On CPU 42 our first mntput() finishes counting. It observes the
decrement of component #69, but not the increment of component #0. As the
result, the total it gets is not 1 as it should've been - it's 0. At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down. However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.

It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput(). IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.

Reported-by: Jann Horn
Tested-by: Jann Horn
Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2018-08-16 00:12:48 +0800

21 Jun, 2018

1 commit

6bf379f26 vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion ... Browse Code »

[ Upstream commit a9e5b73288cf1595ac2e05cf1acd1924ceea05fa ]

In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.

Undo this change.

Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

David Howells
2018-06-21 03:02:45 +0800

24 Apr, 2018

1 commit

d4d49cb1c Don't leak MNT_INTERNAL away from internal mounts ... Browse Code »

commit 16a34adb9392b2fe4195267475ab5b472e55292c upstream.

We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies. As it is, creating a deep stack of bindings of /proc/*/ns/*
somewhere in a new namespace and exiting yields a stack overflow.

Cc: stable@kernel.org
Reported-by: Alexander Aring
Bisected-by: Kirill Tkhai
Tested-by: Kirill Tkhai
Tested-by: Alexander Aring
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2018-04-24 15:36:39 +0800

04 Feb, 2018

1 commit

bdf19237e VFS: Handle lazytime in do_mount() ... Browse Code »

commit d7ee946942bdd12394809305e3df05aa4c8b7b8f upstream.

Since commit e462ec50cb5fa ("VFS: Differentiate mount flags (MS_*) from
internal superblock flags") the lazytime mount option doesn't get passed
on anymore.

Fix the issue by handling the option in do_mount().

Reviewed-by: Lukas Czerner
Signed-off-by: Markus Trippelsdorf
Signed-off-by: Al Viro
Cc: Holger Hoffstätte
Signed-off-by: Greg Kroah-Hartman

Markus Trippelsdorf
2018-02-04 00:38:53 +0800

17 Oct, 2017

1 commit

917086ff2 vfs: fix mounting a filesystem with i_version ... Browse Code »

The mount i_version flag is not enabled in the new sb_flags. This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
superblock flags"
Signed-off-by: Mimi Zohar
Signed-off-by: Al Viro

Mimi Zohar
2017-10-17 14:22:07 +0800

05 Oct, 2017

1 commit

954c736f8 ovl: fix may_write_real() for overlayfs directories ... Browse Code »

Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3c9ab ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800

15 Sep, 2017

2 commits

9e0ce554b Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc leftovers from Al Viro.

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the __user misannotations in asm-generic get_user/put_user
fput: Don't reinvent the wheel but use existing llist API
namespace.c: Don't reinvent the wheel but use existing llist API

Linus Torvalds
2017-09-15 11:01:41 +0800
0f0d12728 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).

This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like

list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

sed -i -e 's/\/SB_RDONLY/g' \
-e 's/\/SB_NOSUID/g' \
-e 's/\/SB_NODEV/g' \
-e 's/\/SB_NOEXEC/g' \
-e 's/\/SB_SYNCHRONOUS/g' \
-e 's/\/SB_MANDLOCK/g' \
-e 's/\/SB_DIRSYNC/g' \
-e 's/\/SB_NOATIME/g' \
-e 's/\/SB_NODIRATIME/g' \
-e 's/\/SB_SILENT/g' \
-e 's/\/SB_POSIXACL/g' \
-e 's/\/SB_KERNMOUNT/g' \
-e 's/\/SB_I_VERSION/g' \
-e 's/\/SB_LAZYTIME/g' \
$list

and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

Linus Torvalds
2017-09-15 09:54:01 +0800

05 Sep, 2017

1 commit

7c6893e3c ovl: don't allow writing ioctl on lower layer ... Browse Code »

Problem with ioctl() is that it's a file operation, yet often used as an
inode operation (i.e. modify the inode despite the file being opened for
read-only).

mnt_want_write_file() is used by filesystems in such cases to get write
access on an arbitrary open file.

Since overlayfs lets filesystems do all file operations, including ioctl,
this can lead to mnt_want_write_file() returning OK for a lower file and
modification of that lower file.

This patch prevents modification by checking if the file is from an
overlayfs lower layer and returning EPERM in that case.

Need to introduce a mnt_want_write_file_path() variant that still does the
old thing for inode operations that can do the copy up + modification
correctly in such cases (fchown, fsetxattr, fremovexattr).

This does not address the correctness of such ioctls on overlayfs (the
correct way would be to copy up and attempt to perform ioctl on upper
file).

In theory this could be a regression. We very much hope that nobody is
relying on such a hack in any sane setup.

While this patch meddles in VFS code, it has no effect on non-overlayfs
filesystems.

Reported-by: "zhangyi (F)"
Signed-off-by: Miklos Szeredi

Miklos Szeredi
2017-09-05 18:53:12 +0800

28 Aug, 2017

1 commit

297857357 namespace.c: Don't reinvent the wheel but use existing llist API ... Browse Code »

Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park
Signed-off-by: Al Viro

Byungchul Park
2017-08-28 12:50:22 +0800

17 Jul, 2017

2 commits

e462ec50c VFS: Differentiate mount flags (MS_*) from internal superblock flags ... Browse Code »

Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags. s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

(1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
MNT_NODEV) without passing this on to the filesystem, but some
filesystems set such flags anyway.

(2) The ->remount_fs() methods of some filesystems adjust the *flags
argument by setting MS_* flags in it, such as MS_NOATIME - but these
flags are then scrubbed by do_remount_sb() (only the occupants of
MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro
Signed-off-by: David Howells

David Howells
2017-07-17 15:45:35 +0800
bc98a42c1 VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) ... Browse Code »

Firstly by applying the following with coccinelle's spatch:

@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)

@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)

to remove left over excess bracketage and finally by applying:

@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells

David Howells
2017-07-17 15:45:34 +0800

16 Jul, 2017

1 commit

78dcf7342 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull ->s_options removal from Al Viro:
"Preparations for fsmount/fsopen stuff (coming next cycle). Everything
gets moved to explicit ->show_options(), killing ->s_options off +
some cosmetic bits around fs/namespace.c and friends. Basically, the
stuff needed to work with fsmount series with minimum of conflicts
with other work.

It's not strictly required for this merge window, but it would reduce
the PITA during the coming cycle, so it would be nice to have those
bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
isofs: Fix isofs_show_options()
VFS: Kill off s_options and helpers
orangefs: Implement show_options
9p: Implement show_options
isofs: Implement show_options
afs: Implement show_options
affs: Implement show_options
befs: Implement show_options
spufs: Implement show_options
bpf: Implement show_options
ramfs: Implement show_options
pstore: Implement show_options
omfs: Implement show_options
hugetlbfs: Implement show_options
VFS: Don't use save/replace_mount_options if not using generic_show_options
VFS: Provide empty name qstr
VFS: Make get_filesystem() return the affected filesystem
VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Provide a function to create a NUL-terminated string from unterminated data

Linus Torvalds
2017-07-16 03:00:42 +0800

11 Jul, 2017

1 commit

1d278a879 VFS: Kill off s_options and helpers ... Browse Code »

Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves. This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2017-07-11 18:09:21 +0800

07 Jul, 2017

1 commit

3d375d785 mm: update callers to use HASH_ZERO flag ... Browse Code »

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations. In case of
places where HASH_EARLY was used such as in __pv_init_lock_hash the
zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 11.798s

After fix:
Dentry cache hash table entries: 1073741824
Inode-cache hash table entries: 536870912
Mount-cache hash table entries: 16777216
Mountpoint-cache hash table entries: 16777216
ftrace: allocating 20414 entries in 40 pages
Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.245s

After fix:
Dentry cache hash table entries: 536870912
Inode-cache hash table entries: 268435456
Mount-cache hash table entries: 8388608
Mountpoint-cache hash table entries: 8388608
CPU: Physical Processor ID: 0
Total time: 3.244s

Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Babu Moger
Cc: David Miller
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2017-07-07 07:24:33 +0800

06 Jul, 2017

2 commits

dd111b31e VFS: Clean up whitespace in fs/namespace.c and fs/super.c ... Browse Code »

Clean up line terminal whitespace in fs/namespace.c and fs/super.c.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2017-07-06 15:27:09 +0800
e5f76a2e0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull mnt namespace updates from Eric Biederman:
"A big break-through came during this development cycle as a way was
found to maintain the existing umount -l semantics while allowing for
optimizations that improve the performance. That is represented by the
first change in this series moving the reparenting of mounts into
their own pass. This has allowed addressing the horrific performance
of umount -l on a carefully crafted tree of mounts with locks held
(0.06s vs 60s in my testing). What allowed this was not changing where
umounts propagate to while propgating umounts.

The next change fixes the case where the order of the mount whose
umount are being progated visits a tree where the mounts are stacked
upon each other in another order. This is weird but not hard to
implement.

The final change takes advantage of the unchanging mount propgation
tree to skip parts of the mount propgation tree that have already been
visited. Yielding a very nice speed up in the worst case.

There remains one outstanding question about the semantics of umount -l
that I am still discussiong with Ram Pai. In practice that area of the
semantics was changed by 1064f874abc0 ("mnt: Tuck mounts under others
instead of creating shadow/side mounts.") and no regressions have been
reported. Still I intend to finish talking that out with him to ensure
there is not something a more intense use of mount propagation in the
future will not cause to become significant"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Make propagate_umount less slow for overlapping mount propagation trees
mnt: In propgate_umount handle visiting mounts in any order
mnt: In umount propagation reparent in a separate pass

Linus Torvalds
2017-07-06 08:00:56 +0800

15 Jun, 2017

1 commit

4068367c9 fs: don't forget to put old mntns in mntns_install ... Browse Code »

Fixes: 4f757f3cbf54 ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro
Signed-off-by: Andrei Vagin
Signed-off-by: Al Viro

Andrei Vagin
2017-06-15 18:53:05 +0800

23 May, 2017

2 commits

99b19d164 mnt: In propgate_umount handle visiting mounts in any order ... Browse Code »

While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities. This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2017-05-23 21:41:16 +0800
570487d3f mnt: In umount propagation reparent in a separate pass ... Browse Code »

It was observed that in some pathlogical cases that the current code
does not unmount everything it should. After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Cc: stable@vger.kernel.org
Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin
Reviewed-by: Ram Pai
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2017-05-23 21:40:32 +0800

13 May, 2017

1 commit

050453295 Merge branch 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Making sure that something like a referral point won't end up as pwd
or root.

The main part is the last commit (fixing mntns_install()); that one
fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
bit more robust - it should be impossible to get opened files (even
O_PATH ones) for referral points in the first place, so the existing
checks are OK, but checking the same thing as in chdir(2) is just as
cheap.

The path_init() commit removes a redundant check that shouldn't have
been there in the first place"

* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make sure that mntns_install() doesn't end up with referral for root
path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
make sure that fchdir() won't accept referral points, etc.

Linus Torvalds
2017-05-13 02:39:59 +0800

22 Apr, 2017

1 commit

4f757f3cb make sure that mntns_install() doesn't end up with referral for root ... Browse Code »

new flag: LOOKUP_DOWN. If the starting point is overmounted, cross
into whatever's mounted on top, triggering referrals et.al.

Use that instead of follow_down_one() loop in mntns_install(), handle
errors properly.

Signed-off-by: Al Viro

Al Viro
2017-04-22 02:05:36 +0800

10 Apr, 2017

2 commits

08991e83b fsnotify: Free fsnotify_mark_connector when there is no mark attached ... Browse Code »

Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi
Reviewed-by: Amir Goldstein
Signed-off-by: Jan Kara

Jan Kara
2017-04-10 23:37:35 +0800
9dd813c15 fsnotify: Move mark list head from object into dedicated structure ... Browse Code »

Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi
Reviewed-by: Amir Goldstein
Signed-off-by: Jan Kara

Jan Kara
2017-04-10 23:37:34 +0800

02 Mar, 2017

2 commits

9164bb4a1 sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/s… ... Browse Code »

…ched.h> to <linux/sched/task.h>

Update all usage sites first.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-03-02 15:42:38 +0800
5b825c3af sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> ... Browse Code »

Add #include dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because is included in over
2,200 files ...

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:31 +0800

03 Feb, 2017

1 commit

1064f874a mnt: Tuck mounts under others instead of creating shadow/side mounts. ... Browse Code »

Ever since mount propagation was introduced in cases where a mount in
propagated to parent mount mountpoint pair that is already in use the
code has placed the new mount behind the old mount in the mount hash
table.

This implementation detail is problematic as it allows creating
arbitrary length mount hash chains.

Furthermore it invalidates the constraint maintained elsewhere in the
mount code that a parent mount and a mountpoint pair will have exactly
one mount upon them. Making it hard to deal with and to talk about
this special case in the mount code.

Modify mount propagation to notice when there is already a mount at
the parent mount and mountpoint where a new mount is propagating to
and place that preexisting mount on top of the new mount.

Modify unmount propagation to notice when a mount that is being
unmounted has another mount on top of it (and no other children), and
to replace the unmounted mount with the mount on top of it.

Move the MNT_UMUONT test from __lookup_mnt_last into
__propagate_umount as that is the only call of __lookup_mnt_last where
MNT_UMOUNT may be set on any mount visible in the mount hash table.

These modifications allow:
- __lookup_mnt_last to be removed.
- attach_shadows to be renamed __attach_mnt and its shadow
handling to be removed.
- commit_tree to be simplified
- copy_tree to be simplified

The result is an easier to understand tree of mounts that does not
allow creation of arbitrary length hash chains in the mount hash table.

The result is also a very slight userspace visible difference in semantics.
The following two cases now behave identically, where before order
mattered:

case 1: (explicit user action)
B is a slave of A
mount something on A/a , it will propagate to B/a
and than mount something on B/a

case 2: (tucked mount)
B is a slave of A
mount something on B/a
and than mount something on A/a

Histroically umount A/a would fail in case 1 and succeed in case 2.
Now umount A/a succeeds in both configurations.

This very small change in semantics appears if anything to be a bug
fix to me and my survey of userspace leads me to believe that no programs
will notice or care of this subtle semantic change.

v2: Updated to mnt_change_mountpoint to not call dput or mntput
and instead to decrement the counts directly. It is guaranteed
that there will be other references when mnt_change_mountpoint is
called so this is safe.

v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
As the locking in fs/namespace.c changed between v2 and v3.

v4: Reworked the logic in propagate_mount_busy and __propagate_umount
that detects when a mount completely covers another mount.

v5: Removed unnecessary tests whose result is alwasy true in
find_topper and attach_recursive_mnt.

v6: Document the user space visible semantic difference.

Cc: stable@vger.kernel.org
Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind")
Tested-by: Andrei Vagin
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2017-02-03 19:01:06 +0800

01 Feb, 2017

1 commit

93faccbbf fs: Better permission checking for submounts ... Browse Code »

To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.

The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.

Attempting to handle the automount case by using override_creds
almost works. It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.

Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.

vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.

sget and sget_userns are modified to not perform any permission checks
on submounts.

follow_automount is modified to stop using override_creds as that
has proven problemantic.

do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.

autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.

cifs is modified to pass the mountpoint all of the way down to vfs_submount.

debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter. To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.

Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust
Reviewed-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2017-02-01 23:36:12 +0800

10 Jan, 2017

1 commit

3895dbf89 mnt: Protect the mountpoint hashtable with mount_lock ... Browse Code »

Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.

Kristen Johansen reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.

Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.

I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.

d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.

Cc: stable@vger.kernel.org
Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen
Suggested-by: Al Viro
Acked-by: Al Viro
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2017-01-10 08:34:43 +0800

23 Dec, 2016

1 commit

faf0dcebd Merge branch 'work.namespace' into for-linus Browse Code »

Al Viro
2016-12-23 12:04:31 +0800

17 Dec, 2016

4 commits

9763f7a4a Merge branch 'work.autofs' into for-linus ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-12-17 05:34:52 +0800
5235d448c reorganize do_make_slave() ... Browse Code »

Make sure that clone_mnt() never returns a mount with MNT_SHARED in
flags, but without a valid ->mnt_group_id. That allows to demystify
do_make_slave() quite a bit, among other things.

Signed-off-by: Al Viro

Al Viro
2016-12-17 05:30:49 +0800
066715d3f clone_private_mount() doesn't need to touch namespace_sem ... Browse Code »

not for CL_PRIVATE clone_mnt()

Signed-off-by: Al Viro

Al Viro
2016-12-17 05:30:49 +0800
f4cc1c381 remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id() ... Browse Code »

Hadn't been true for quite a while

Signed-off-by: Al Viro

Al Viro
2016-12-17 05:30:48 +0800