Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2017

1 commit

1a62a0f76 mnt: Protect the mountpoint hashtable with mount_lock ... Browse Code »

commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream.

Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.

Kristen Johansen reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.

Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.

I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.

d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.

Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen
Suggested-by: Al Viro
Acked-by: Al Viro
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2017-01-20 03:18:03 +0800

16 Oct, 2016

1 commit

9ffc66941 Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin

Linus Torvalds
2016-10-16 01:03:15 +0800

11 Oct, 2016

2 commits

0766f788e latent_entropy: Mark functions with __latent_entropy ... Browse Code »

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook

Emese Revfy
2016-10-11 05:51:45 +0800
abb5a14fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.

There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...

Linus Torvalds
2016-10-11 04:04:49 +0800

01 Oct, 2016

1 commit

d29216842 mnt: Add a per mount namespace limit on the number of mounts ... Browse Code »

CAI Qian pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-10-01 01:46:48 +0800

23 Sep, 2016

3 commits

787255966 Merge branch 'nsfs-ioctls' into HEAD ... Browse Code »

From: Andrey Vagin

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:

fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.

NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.

In addition to generic ioctl(2) errors, the following specific ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM The requested namespace is outside of the current namespace
scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Cc: "Eric W. Biederman"
Cc: James Bottomley
Cc: "Michael Kerrisk (man-pages)"
Cc: "W. Trevor King"
Cc: Alexander Viro
Cc: Serge Hallyn

Eric W. Biederman
2016-09-23 09:00:36 +0800
bcac25a58 kernel: add a helper to get an owning user namespace for a namespace ... Browse Code »

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn
Signed-off-by: Andrei Vagin
Signed-off-by: Eric W. Biederman

Andrey Vagin
2016-09-23 08:59:39 +0800
df75e7748 userns: When the per user per user namespace limit is reached return ENOSPC ... Browse Code »

The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-09-23 02:25:56 +0800

16 Sep, 2016

1 commit

c568d6834 locks: fix file locking on overlayfs ... Browse Code »

This patch allows flock, posix locks, ofd locks and leases to work
correctly on overlayfs.

Instead of using the underlying inode for storing lock context use the
overlay inode. This allows locks to be persistent across copy-up.

This is done by introducing locks_inode() helper and using it instead of
file_inode() to get the inode in locking code. For non-overlayfs the two
are equivalent, except for an extra pointer dereference in locks_inode().

Since lock operations are in "struct file_operations" we must also make
sure not to call underlying filesystem's lock operations. Introcude a
super block flag MS_NOREMOTELOCK to this effect.

Signed-off-by: Miklos Szeredi
Acked-by: Jeff Layton
Cc: "J. Bruce Fields"

Miklos Szeredi
2016-09-16 18:44:20 +0800

31 Aug, 2016

1 commit

537f7ccb3 mntns: Add a limit on the number of mount namespaces. ... Browse Code »

v2: Fixed the very obvious lack of setting ucounts
on struct mnt_ns reported by Andrei Vagin, and the kbuild
test report.

Reported-by: Andrei Vagin
Acked-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-08-31 20:28:35 +0800

30 Jul, 2016

1 commit

a867d7349 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.

While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.

At a high level what this patchset does a couple of simple things:

- Add a user namespace owner (s_user_ns) to struct super_block.

- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.

By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.

One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).

This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.

These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.

- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.

- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.

Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.

There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.

- Verifying the mounter has permission to read/write the block device
during mount.

- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.

- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).

Furthermore there are a few changes that are on the wishlist:

- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]

- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.

- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.

I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.

Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...

Linus Torvalds
2016-07-30 06:54:19 +0800

02 Jul, 2016

1 commit

48c4565ed Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"Tmpfs readdir throughput regression fix (this cycle) + some -stable
fodder all over the place.

One missing bit is Miklos' tonight locks.c fix - NFS folks had already
grabbed that one by the time I woke up ;-)"

[ The locks.c fix came through the nfsd tree just moments ago ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namespace: update event counter when umounting a deleted dentry
9p: use file_dentry()
ceph: fix d_obtain_alias() misuses
lockless next_positive()
libfs.c: new helper - next_positive()
dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock

Linus Torvalds
2016-07-02 06:20:11 +0800

01 Jul, 2016

1 commit

e06b933e6 namespace: update event counter when umounting a deleted dentry ... Browse Code »

- m_start() in fs/namespace.c expects that ns->event is incremented each
time a mount added or removed from ns->list.
- umount_tree() removes items from the list but does not increment event
counter, expecting that it's done before the function is called.
- There are some codepaths that call umount_tree() without updating
"event" counter. e.g. from __detach_mounts().
- When this happens m_start may reuse a cached mount structure that no
longer belongs to ns->list (i.e. use after free which usually leads
to infinite loop).

This change fixes the above problem by incrementing global event counter
before invoking umount_tree().

Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
Cc: stable@vger.kernel.org
Signed-off-by: Andrey Ulanov
Signed-off-by: Al Viro

Andrey Ulanov
2016-07-01 11:28:30 +0800

24 Jun, 2016

5 commits

380cf5ba6 fs: Treat foreign mounts as nosuid ... Browse Code »

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski
Signed-off-by: Seth Forshee
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Andy Lutomirski
2016-06-24 23:40:41 +0800
67690f937 userns: Remove implicit MNT_NODEV fragility. ... Browse Code »

Replace the implict setting of MNT_NODEV on mounts that happen with
just user namespace permissions with an implicit setting of SB_I_NODEV
in s_iflags. The visibility of the implicit MNT_NODEV has caused
problems in the past.

With this change the fragile case where an implicit MNT_NODEV needs to
be preserved in do_remount is removed. Using SB_I_NODEV is much less
fragile as s_iflags are set during the original mount and never
changed.

In do_new_mount with the implicit setting of MNT_NODEV gone, the only
code that can affect mnt_flags is fs_fully_visible so simplify the if
statement and reduce the indentation of the code to make that clear.

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-24 04:47:23 +0800
a1935c173 mnt: Simplify mount_too_revealing ... Browse Code »

Verify all filesystems that we check in mount_too_revealing set
SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today
and it should remain true in the future.

Remove the now unnecessary checks from mnt_already_visibile that
ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
preserved. Making the code shorter and easier to read.

Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
and several slightly buggy container applications don't bother to
set those flags continue to work.

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-24 04:41:57 +0800
a001e74ce mnt: Move the FS_USERNS_MOUNT check into sget_userns ... Browse Code »

Allowing a filesystem to be mounted by other than root in the initial
user namespace is a filesystem property not a mount namespace property
and as such should be checked in filesystem specific code. Move the
FS_USERNS_MOUNT test into super.c:sget_userns().

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-24 04:41:55 +0800
8654df4e2 mnt: Refactor fs_fully_visible into mount_too_revealing ... Browse Code »

Replace the call of fs_fully_visible in do_new_mount from before the
new superblock is allocated with a call of mount_too_revealing after
the superblock is allocated. This winds up being a much better location
for maintainability of the code.

The first change this enables is the replacement of FS_USERNS_VISIBLE
with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type
to sb_iflags on the superblock.

Unfortunately mount_too_revealing fundamentally needs to touch
mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
times. If the mnt_flags did not need to be touched the code
could be easily moved into the filesystem specific mount code.

Acked-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-24 04:41:46 +0800

15 Jun, 2016

1 commit

695e9df01 mnt: Account for MS_RDONLY in fs_fully_visible ... Browse Code »

In rare cases it is possible for s_flags & MS_RDONLY to be set but
MNT_READONLY to be clear. This starting combination can cause
fs_fully_visible to fail to ensure that the new mount is readonly.
Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
is set on the source filesystem of the mount.

In general both MS_RDONLY and MNT_READONLY are set at the same for
mounts so I don't expect any programs to care. Nor do I expect
MS_RDONLY to be set on proc or sysfs in the initial user namespace,
which further decreases the likelyhood of problems.

Which means this change should only affect system configurations by
paranoid sysadmins who should welcome the additional protection
as it keeps people from wriggling out of their policies.

Cc: stable@vger.kernel.org
Fixes: 8c6cf9cc829f ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-15 19:52:23 +0800

07 Jun, 2016

2 commits

d71ed6c93 mnt: fs_fully_visible test the proper mount for MNT_LOCKED ... Browse Code »

MNT_LOCKED implies on a child mount implies the child is locked to the
parent. So while looping through the children the children should be
tested (not their parent).

Typically an unshare of a mount namespace locks all mounts together
making both the parent and the slave as locked but there are a few
corner cases where other things work.

Cc: stable@vger.kernel.org
Fixes: ceeb0e5d39fc ("vfs: Ignore unlocked mounts in fs_fully_visible")
Reported-by: Seth Forshee
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-07 09:52:03 +0800
97c1df3e5 mnt: If fs_fully_visible fails call put_filesystem. ... Browse Code »

Add this trivial missing error handling.

Cc: stable@vger.kernel.org
Fixes: 1b852bceb0d1 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-06-07 09:48:31 +0800

23 Jan, 2016

1 commit

5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
2016-01-23 07:04:28 +0800

13 Jan, 2016

1 commit

33caf82ac Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.

Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.

One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.

There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:

-----
| This is an automated patch using
|
| sed 's/mutex_lock(&$.*$->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&$.*$->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&$.*$->i_mutex,[ ]*I_MUTEX_$[A-Z0-9_]*$)/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&$.*$->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&$.*$->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----

I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...

Linus Torvalds
2016-01-13 09:11:47 +0800

04 Jan, 2016

1 commit

b40ef8696 saner calling conventions for copy_mount_options() ... Browse Code »

let it just return NULL, pointer to kernel copy or ERR_PTR().

Signed-off-by: Al Viro

Al Viro
2016-01-04 23:28:32 +0800

07 Dec, 2015

1 commit

25ab4c9b1 fs/namespace.c: path_is_under can be boolean ... Browse Code »

This patch makes path_is_under return bool to improve
readability due to this particular function only using either
one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Al Viro

Yaowei Bai
2015-12-07 10:17:13 +0800

16 Nov, 2015

2 commits

95ace7541 locks: Don't allow mounts in user namespaces to enable mandatory locking ... Browse Code »

Since no one uses mandatory locking and files with mandatory locks can
cause problems don't allow them in user namespaces.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Jeff Layton

Eric W. Biederman
2015-11-16 23:01:34 +0800
9e8925b67 locks: Allow disabling mandatory locking at compile time ... Browse Code »

Mandatory locking appears to be almost unused and buggy and there
appears no real interest in doing anything with it. Since effectively
no one uses the code and since the code is buggy let's allow it to be
disabled at compile time. I would just suggest removing the code but
undoubtedly that will break some piece of userspace code somewhere.

For the distributions that don't care about this piece of code
this gives a nice starting point to make mandatory locking go away.

Cc: Benjamin Coddington
Cc: Dmitry Vyukov
Cc: Jeff Layton
Cc: J. Bruce Fields
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Jeff Layton

Jeff Layton
2015-11-16 22:49:34 +0800

02 Sep, 2015

1 commit

73b6fa8e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace updates from Eric Biederman:
"This finishes up the changes to ensure proc and sysfs do not start
implementing executable files, as the there are application today that
are only secure because such files do not exist.

It akso fixes a long standing misfeature of /proc//mountinfo that
did not show the proper source for files bind mounted from
/proc//ns/*.

It also straightens out the handling of clone flags related to user
namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER)
when files such as /proc//environ are read while is calling
unshare. This winds up fixing a minor bug in unshare flag handling
that dates back to the first version of unshare in the kernel.

Finally, this fixes a minor regression caused by the introduction of
sysfs_create_mount_point, which broke someone's in house application,
by restoring the size of /sys/fs/cgroup to 0 bytes. Apparently that
application uses the directory size to determine if a tmpfs is mounted
on /sys/fs/cgroup.

The bind mount escape fixes are present in Al Viros for-next branch.
and I expect them to come from there. The bind mount escape is the
last of the user namespace related security bugs that I am aware of"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fs: Set the size of empty dirs to 0.
userns,pidns: Force thread group sharing, not signal handler sharing.
unshare: Unsharing a thread does not require unsharing a vm
nsfs: Add a show_path method to fix mountinfo
mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC
vfs: Commit to never having exectuables on proc and sysfs.

Linus Torvalds
2015-09-02 07:13:25 +0800

24 Jul, 2015

1 commit

fe78fcc85 mnt: In detach_mounts detach the appropriate unmounted mount ... Browse Code »

The handling of in detach_mounts of unmounted but connected mounts is
buggy and can lead to an infinite loop.

Correct the handling of unmounted mounts in detach_mount. When the
mountpoint of an unmounted but connected mount is connected to a
dentry, and that dentry is deleted we need to disconnect that mount
from the parent mount and the deleted dentry.

Nothing changes for the unmounted and connected children. They can be
safely ignored.

Cc: stable@vger.kernel.org
Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-24 00:31:15 +0800

23 Jul, 2015

1 commit

f2d0a123b mnt: Clarify and correct the disconnect logic in umount_tree ... Browse Code »

rmdir mntpoint will result in an infinite loop when there is
a mount locked on the mountpoint in another mount namespace.

This is because the logic to test to see if a mount should
be disconnected in umount_tree is buggy.

Move the logic to decide if a mount should remain connected to
it's mountpoint into it's own function disconnect_mount so that
clarity of expression instead of terseness of expression becomes
a virtue.

When the conditions where it is invalid to leave a mount connected
are first ruled out, the logic for deciding if a mount should
be disconnected becomes much clearer and simpler.

Fixes: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected
Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-23 09:33:27 +0800

10 Jul, 2015

1 commit

77b1a97d2 mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC ... Browse Code »

The filesystems proc and sysfs do not have executable files do not
have exectuable files today and portions of userspace break if we do
enforce nosuid and noexec consistency of nosuid and noexec flags
between previous mounts and new mounts of proc and sysfs.

Add the code to enforce consistency of the nosuid and noexec flags,
and use the presence of SB_I_NOEXEC to signal that there is no need to
bother.

This results in a completely userspace invisible change that makes it
clear fs_fully_visible can only skip the enforcement of noexec and
nosuid because it is known the filesystems in question do not support
executables.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-10 23:41:13 +0800

04 Jul, 2015

1 commit

0cbee9926 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace updates from Eric Biederman:
"Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.

Some hacks were put in place to reduce the worst of the damage could
be done, and the common sense rule was adopted that fresh mounts of
proc and sysfs should allow no more than bind mounts of proc and
sysfs. Unfortunately that rule has not been fully enforced.

There are two kinds of gaps in that enforcement. Only filesystems
mounted on empty directories of proc and sysfs should be ignored but
the test for empty directories was insufficient. So in my tree
directories on proc, sysctl and sysfs that will always be empty are
created specially. Every other technique is imperfect as an ordinary
directory can have entries added even after a readdir returns and
shows that the directory is empty. Special creation of directories
for mount points makes the code in the kernel a smidge clearer about
it's purpose. I asked container developers from the various container
projects to help test this and no holes were found in the set of mount
points on proc and sysfs that are created specially.

This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of
proc and sysfs. I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags
on the previous mount of proc and sysfs. So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced. Dealing with the noexec and nosuid
attributes remains for another time.

This set of changes also addresses an issue with how open file
descriptors from /proc//ns/* are displayed. Recently readlink of
/proc//fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.

There is also a short list of issues that have not been fixed yet that
I will mention briefly.

It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can
be walked up to the root directory of the filesystem. With user
namespaces enabled a bind mount of the bind mount can be created
allowing the user to pick a directory whose children they can rename
to outside of the bind mount. This is challenging to fix and doubly
so because all obvious solutions must touch code that is in the
performance part of pathname resolution.

As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable
files on sysfs and proc and in doing so introduce security regressions
in the current userspace that will not be immediately obvious and as
such are likely to require breaking userspace in painful ways once
they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Remove incorrect debugging WARN in prepend_path
mnt: Update fs_fully_visible to test for permanently empty directories
sysfs: Create mountpoints with sysfs_create_mount_point
sysfs: Add support for permanently empty directories to serve as mount points.
kernfs: Add support for always empty directories.
proc: Allow creating permanently empty directories that serve as mount points
sysctl: Allow creating permanently empty directories that serve as mountpoints.
fs: Add helper functions for permanently empty directories.
vfs: Ignore unlocked mounts in fs_fully_visible
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
mnt: Refactor the logic for mounting sysfs and proc in a user namespace

Linus Torvalds
2015-07-04 06:20:57 +0800

01 Jul, 2015

3 commits

7236c85e1 mnt: Update fs_fully_visible to test for permanently empty directories ... Browse Code »

fs_fully_visible attempts to make fresh mounts of proc and sysfs give
the mounter no more access to proc and sysfs than if they could have
by creating a bind mount. One aspect of proc and sysfs that makes
this particularly tricky is that there are other filesystems that
typically mount on top of proc and sysfs. As those filesystems are
mounted on empty directories in practice it is safe to ignore them.
However testing to ensure filesystems are mounted on empty directories
has not been something the in kernel data structures have supported so
the current test for an empty directory which checks to see
if nlink i_mode) && i_nlink

Eric W. Biederman
2015-07-01 23:36:49 +0800
ceeb0e5d3 vfs: Ignore unlocked mounts in fs_fully_visible ... Browse Code »

Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-01 23:36:35 +0800
ede1bf0dc fs: use seq_open_private() for proc_mounts ... Browse Code »

A patchset to remove support for passing pre-allocated struct seq_file to
seq_open(). Such feature is undocumented and prone to error.

In particular, if seq_release() is used in release handler, it will
kfree() a pointer which was not allocated by seq_open().

So this patchset drops support for pre-allocated struct seq_file: it's
only of use in proc_namespace.c and can be easily replaced by using
seq_open_private()/seq_release_private().

Additionally, it documents the use of file->private_data to hold pointer
to struct seq_file by seq_open().

This patch (of 3):

Since patch described below, from v2.6.15-rc1, seq_open() could use a
struct seq_file already allocated by the caller if the pointer to the
structure is stored in file->private_data before calling the function.

Commit 1abe77b0fc4b485927f1f798ae81a752677e1d05
Author: Al Viro
Date: Mon Nov 7 17:15:34 2005 -0500

[PATCH] allow callers of seq_open do allocation themselves

Allow caller of seq_open() to kmalloc() seq_file + whatever else they
want and set ->private_data to it. seq_open() will then abstain from
doing allocation itself.

Such behavior is only used by mounts_open_common().

In order to drop support for such uncommon feature, proc_mounts is
converted to use seq_open_private(), which take care of allocating the
proc_mounts structure, making it available through ->private in struct
seq_file.

Conversely, proc_mounts is converted to use seq_release_private(), in
order to release the private structure allocated by seq_open_private().

Then, ->private is used directly instead of proc_mounts() macro to access
to the proc_mounts structure.

Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yann Droneaud
2015-07-01 10:44:56 +0800

04 Jun, 2015

1 commit

8c6cf9cc8 mnt: Modify fs_fully_visible to deal with locked ro nodev and atime ... Browse Code »

Ignore an existing mount if the locked readonly, nodev or atime
attributes are less permissive than the desired attributes
of the new mount.

On success ensure the new mount locks all of the same readonly, nodev and
atime attributes as the old mount.

The nosuid and noexec attributes are not checked here as this change
is destined for stable and enforcing those attributes causes a
regression in lxc and libvirt-lxc where those applications will not
start and there are no known executables on sysfs or proc and no known
way to create exectuables without code modifications

Cc: stable@vger.kernel.org
Fixes: e51db73532955 ("userns: Better restrictions on when proc and sysfs can be mounted")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-06-04 23:29:25 +0800

14 May, 2015

1 commit

1b852bceb mnt: Refactor the logic for mounting sysfs and proc in a user namespace ... Browse Code »

Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount. Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags. Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-05-14 10:44:11 +0800

11 May, 2015

1 commit

294d71ff2 new helper: __legitimize_mnt() ... Browse Code »

same as legitimize_mnt(), except that it does *not* drop and regain
rcu_read_lock; return values are
0 => grabbed a reference, we are fine
1 => failed, just go away
-1 => failed, go away and mntput(bastard) when outside of rcu_read_lock

Signed-off-by: Al Viro

Al Viro
2015-05-11 20:13:14 +0800

10 May, 2015

1 commit

7e96c1b0e mnt: Fix fs_fully_visible to verify the root directory is visible ... Browse Code »

This fixes a dumb bug in fs_fully_visible that allows proc or sys to
be mounted if there is a bind mount of part of /proc/ or /sys/ visible.

Cc: stable@vger.kernel.org
Reported-by: Eric Windisch
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-05-10 00:55:50 +0800

10 Apr, 2015

1 commit

e0c9c0afd mnt: Update detach_mounts to leave mounts connected ... Browse Code »

Now that it is possible to lazily unmount an entire mount tree and
leave the individual mounts connected to each other add a new flag
UMOUNT_CONNECTED to umount_tree to force this behavior and use
this flag in detach_mounts.

This closes a bug where the deletion of a file or directory could
trigger an unmount and reveal data under a mount point.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-04-10 00:39:57 +0800