20 Jan, 2017

1 commit

  • commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream.

    Protecting the mountpoint hashtable with namespace_sem was sufficient
    until a call to umount_mnt was added to mntput_no_expire. At which
    point it became possible for multiple calls of put_mountpoint on
    the same hash chain to happen on the same time.

    Kristen Johansen reported:
    > This can cause a panic when simultaneous callers of put_mountpoint
    > attempt to free the same mountpoint. This occurs because some callers
    > hold the mount_hash_lock, while others hold the namespace lock. Some
    > even hold both.
    >
    > In this submitter's case, the panic manifested itself as a GP fault in
    > put_mountpoint() when it called hlist_del() and attempted to dereference
    > a m_hash.pprev that had been poisioned by another thread.

    Al Viro observed that the simple fix is to switch from using the namespace_sem
    to the mount_lock to protect the mountpoint hash table.

    I have taken Al's suggested patch moved put_mountpoint in pivot_root
    (instead of taking mount_lock an additional time), and have replaced
    new_mountpoint with get_mountpoint a function that does the hash table
    lookup and addition under the mount_lock. The introduction of get_mounptoint
    ensures that only the mount_lock is needed to manipulate the mountpoint
    hashtable.

    d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
    already set. This allows get_mountpoint to use the setting of
    DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
    happens exactly once.

    Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
    Reported-by: Krister Johansen
    Suggested-by: Al Viro
    Acked-by: Al Viro
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Sep, 2016

3 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     
  • The current error codes returned when a the per user per user
    namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
    asked for advice on linux-api and it we made clear that those were
    the wrong error code, but a correct effor code was not suggested.

    The best general error code I have found for hitting a resource limit
    is ENOSPC. It is not perfect but as it is unambiguous it will serve
    until someone comes up with a better error code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

16 Sep, 2016

1 commit

  • This patch allows flock, posix locks, ofd locks and leases to work
    correctly on overlayfs.

    Instead of using the underlying inode for storing lock context use the
    overlay inode. This allows locks to be persistent across copy-up.

    This is done by introducing locks_inode() helper and using it instead of
    file_inode() to get the inode in locking code. For non-overlayfs the two
    are equivalent, except for an extra pointer dereference in locks_inode().

    Since lock operations are in "struct file_operations" we must also make
    sure not to call underlying filesystem's lock operations. Introcude a
    super block flag MS_NOREMOTELOCK to this effect.

    Signed-off-by: Miklos Szeredi
    Acked-by: Jeff Layton
    Cc: "J. Bruce Fields"

    Miklos Szeredi
     

31 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

02 Jul, 2016

1 commit

  • Pull vfs fixes from Al Viro:
    "Tmpfs readdir throughput regression fix (this cycle) + some -stable
    fodder all over the place.

    One missing bit is Miklos' tonight locks.c fix - NFS folks had already
    grabbed that one by the time I woke up ;-)"

    [ The locks.c fix came through the nfsd tree just moments ago ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    namespace: update event counter when umounting a deleted dentry
    9p: use file_dentry()
    ceph: fix d_obtain_alias() misuses
    lockless next_positive()
    libfs.c: new helper - next_positive()
    dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock

    Linus Torvalds
     

01 Jul, 2016

1 commit

  • - m_start() in fs/namespace.c expects that ns->event is incremented each
    time a mount added or removed from ns->list.
    - umount_tree() removes items from the list but does not increment event
    counter, expecting that it's done before the function is called.
    - There are some codepaths that call umount_tree() without updating
    "event" counter. e.g. from __detach_mounts().
    - When this happens m_start may reuse a cached mount structure that no
    longer belongs to ns->list (i.e. use after free which usually leads
    to infinite loop).

    This change fixes the above problem by incrementing global event counter
    before invoking umount_tree().

    Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrey Ulanov
    Signed-off-by: Al Viro

    Andrey Ulanov
     

24 Jun, 2016

5 commits

  • If a process gets access to a mount from a different user
    namespace, that process should not be able to take advantage of
    setuid files or selinux entrypoints from that filesystem. Prevent
    this by treating mounts from other mount namespaces and those not
    owned by current_user_ns() or an ancestor as nosuid.

    This will make it safer to allow more complex filesystems to be
    mounted in non-root user namespaces.

    This does not remove the need for MNT_LOCK_NOSUID. The setuid,
    setgid, and file capability bits can no longer be abused if code in
    a user namespace were to clear nosuid on an untrusted filesystem,
    but this patch, by itself, is insufficient to protect the system
    from abuse of files that, when execed, would increase MAC privilege.

    As a more concrete explanation, any task that can manipulate a
    vfsmount associated with a given user namespace already has
    capabilities in that namespace and all of its descendents. If they
    can cause a malicious setuid, setgid, or file-caps executable to
    appear in that mount, then that executable will only allow them to
    elevate privileges in exactly the set of namespaces in which they
    are already privileges.

    On the other hand, if they can cause a malicious executable to
    appear with a dangerous MAC label, running it could change the
    caller's security context in a way that should not have been
    possible, even inside the namespace in which the task is confined.

    As a hardening measure, this would have made CVE-2014-5207 much
    more difficult to exploit.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Seth Forshee
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Andy Lutomirski
     
  • Replace the implict setting of MNT_NODEV on mounts that happen with
    just user namespace permissions with an implicit setting of SB_I_NODEV
    in s_iflags. The visibility of the implicit MNT_NODEV has caused
    problems in the past.

    With this change the fragile case where an implicit MNT_NODEV needs to
    be preserved in do_remount is removed. Using SB_I_NODEV is much less
    fragile as s_iflags are set during the original mount and never
    changed.

    In do_new_mount with the implicit setting of MNT_NODEV gone, the only
    code that can affect mnt_flags is fs_fully_visible so simplify the if
    statement and reduce the indentation of the code to make that clear.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Verify all filesystems that we check in mount_too_revealing set
    SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today
    and it should remain true in the future.

    Remove the now unnecessary checks from mnt_already_visibile that
    ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
    preserved. Making the code shorter and easier to read.

    Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
    MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
    systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
    and several slightly buggy container applications don't bother to
    set those flags continue to work.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Allowing a filesystem to be mounted by other than root in the initial
    user namespace is a filesystem property not a mount namespace property
    and as such should be checked in filesystem specific code. Move the
    FS_USERNS_MOUNT test into super.c:sget_userns().

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Replace the call of fs_fully_visible in do_new_mount from before the
    new superblock is allocated with a call of mount_too_revealing after
    the superblock is allocated. This winds up being a much better location
    for maintainability of the code.

    The first change this enables is the replacement of FS_USERNS_VISIBLE
    with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type
    to sb_iflags on the superblock.

    Unfortunately mount_too_revealing fundamentally needs to touch
    mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
    times. If the mnt_flags did not need to be touched the code
    could be easily moved into the filesystem specific mount code.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Jun, 2016

1 commit

  • In rare cases it is possible for s_flags & MS_RDONLY to be set but
    MNT_READONLY to be clear. This starting combination can cause
    fs_fully_visible to fail to ensure that the new mount is readonly.
    Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
    is set on the source filesystem of the mount.

    In general both MS_RDONLY and MNT_READONLY are set at the same for
    mounts so I don't expect any programs to care. Nor do I expect
    MS_RDONLY to be set on proc or sysfs in the initial user namespace,
    which further decreases the likelyhood of problems.

    Which means this change should only affect system configurations by
    paranoid sysadmins who should welcome the additional protection
    as it keeps people from wriggling out of their policies.

    Cc: stable@vger.kernel.org
    Fixes: 8c6cf9cc829f ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

07 Jun, 2016

2 commits

  • MNT_LOCKED implies on a child mount implies the child is locked to the
    parent. So while looping through the children the children should be
    tested (not their parent).

    Typically an unshare of a mount namespace locks all mounts together
    making both the parent and the slave as locked but there are a few
    corner cases where other things work.

    Cc: stable@vger.kernel.org
    Fixes: ceeb0e5d39fc ("vfs: Ignore unlocked mounts in fs_fully_visible")
    Reported-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Add this trivial missing error handling.

    Cc: stable@vger.kernel.org
    Fixes: 1b852bceb0d1 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

13 Jan, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of stuff. That probably should've been 5 or 6 separate
    branches, but by the time I'd realized how large and mixed that bag
    had become it had been too close to -final to play with rebasing.

    Some fs/namei.c cleanups there, memdup_user_nul() introduction and
    switching open-coded instances, burying long-dead code, whack-a-mole
    of various kinds, several new helpers for ->llseek(), assorted
    cleanups and fixes from various people, etc.

    One piece probably deserves special mention - Neil's
    lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
    called without ->i_mutex and tries to avoid ever taking it. That, of
    course, means that it's not useful for any directory modifications,
    but things like getting inode attributes in nfds readdirplus are fine
    with that. I really should've asked for moratorium on lookup-related
    changes this cycle, but since I hadn't done that early enough... I
    *am* asking for that for the coming cycle, though - I'm going to try
    and get conversion of i_mutex to rwsem with ->lookup() done under lock
    taken shared.

    There will be a patch closer to the end of the window, along the lines
    of the one Linus had posted last May - mechanical conversion of
    ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
    inode_is_locked()/inode_lock_nested(). To quote Linus back then:

    -----
    | This is an automated patch using
    |
    | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    | with a very few manual fixups
    -----

    I'm going to send that once the ->i_mutex-affecting stuff in -next
    gets mostly merged (or when Linus says he's about to stop taking
    merges)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    nfsd: don't hold i_mutex over userspace upcalls
    fs:affs:Replace time_t with time64_t
    fs/9p: use fscache mutex rather than spinlock
    proc: add a reschedule point in proc_readfd_common()
    logfs: constify logfs_block_ops structures
    fcntl: allow to set O_DIRECT flag on pipe
    fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
    fs: xattr: Use kvfree()
    [s390] page_to_phys() always returns a multiple of PAGE_SIZE
    nbd: use ->compat_ioctl()
    fs: use block_device name vsprintf helper
    lib/vsprintf: add %*pg format specifier
    fs: use gendisk->disk_name where possible
    poll: plug an unused argument to do_poll
    amdkfd: don't open-code memdup_user()
    cdrom: don't open-code memdup_user()
    rsxx: don't open-code memdup_user()
    mtip32xx: don't open-code memdup_user()
    [um] mconsole: don't open-code memdup_user_nul()
    [um] hostaudio: don't open-code memdup_user()
    ...

    Linus Torvalds
     

04 Jan, 2016

1 commit


07 Dec, 2015

1 commit


16 Nov, 2015

2 commits

  • Since no one uses mandatory locking and files with mandatory locks can
    cause problems don't allow them in user namespaces.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Jeff Layton

    Eric W. Biederman
     
  • Mandatory locking appears to be almost unused and buggy and there
    appears no real interest in doing anything with it. Since effectively
    no one uses the code and since the code is buggy let's allow it to be
    disabled at compile time. I would just suggest removing the code but
    undoubtedly that will break some piece of userspace code somewhere.

    For the distributions that don't care about this piece of code
    this gives a nice starting point to make mandatory locking go away.

    Cc: Benjamin Coddington
    Cc: Dmitry Vyukov
    Cc: Jeff Layton
    Cc: J. Bruce Fields
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Jeff Layton

    Jeff Layton
     

02 Sep, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "This finishes up the changes to ensure proc and sysfs do not start
    implementing executable files, as the there are application today that
    are only secure because such files do not exist.

    It akso fixes a long standing misfeature of /proc//mountinfo that
    did not show the proper source for files bind mounted from
    /proc//ns/*.

    It also straightens out the handling of clone flags related to user
    namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER)
    when files such as /proc//environ are read while is calling
    unshare. This winds up fixing a minor bug in unshare flag handling
    that dates back to the first version of unshare in the kernel.

    Finally, this fixes a minor regression caused by the introduction of
    sysfs_create_mount_point, which broke someone's in house application,
    by restoring the size of /sys/fs/cgroup to 0 bytes. Apparently that
    application uses the directory size to determine if a tmpfs is mounted
    on /sys/fs/cgroup.

    The bind mount escape fixes are present in Al Viros for-next branch.
    and I expect them to come from there. The bind mount escape is the
    last of the user namespace related security bugs that I am aware of"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    fs: Set the size of empty dirs to 0.
    userns,pidns: Force thread group sharing, not signal handler sharing.
    unshare: Unsharing a thread does not require unsharing a vm
    nsfs: Add a show_path method to fix mountinfo
    mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC
    vfs: Commit to never having exectuables on proc and sysfs.

    Linus Torvalds
     

24 Jul, 2015

1 commit

  • The handling of in detach_mounts of unmounted but connected mounts is
    buggy and can lead to an infinite loop.

    Correct the handling of unmounted mounts in detach_mount. When the
    mountpoint of an unmounted but connected mount is connected to a
    dentry, and that dentry is deleted we need to disconnect that mount
    from the parent mount and the deleted dentry.

    Nothing changes for the unmounted and connected children. They can be
    safely ignored.

    Cc: stable@vger.kernel.org
    Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Jul, 2015

1 commit

  • rmdir mntpoint will result in an infinite loop when there is
    a mount locked on the mountpoint in another mount namespace.

    This is because the logic to test to see if a mount should
    be disconnected in umount_tree is buggy.

    Move the logic to decide if a mount should remain connected to
    it's mountpoint into it's own function disconnect_mount so that
    clarity of expression instead of terseness of expression becomes
    a virtue.

    When the conditions where it is invalid to leave a mount connected
    are first ruled out, the logic for deciding if a mount should
    be disconnected becomes much clearer and simpler.

    Fixes: e0c9c0afd2fc958ffa34b697972721d81df8a56f mnt: Update detach_mounts to leave mounts connected
    Fixes: ce07d891a0891d3c0d0c2d73d577490486b809e1 mnt: Honor MNT_LOCKED when detaching mounts
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jul, 2015

1 commit

  • The filesystems proc and sysfs do not have executable files do not
    have exectuable files today and portions of userspace break if we do
    enforce nosuid and noexec consistency of nosuid and noexec flags
    between previous mounts and new mounts of proc and sysfs.

    Add the code to enforce consistency of the nosuid and noexec flags,
    and use the presence of SB_I_NOEXEC to signal that there is no need to
    bother.

    This results in a completely userspace invisible change that makes it
    clear fs_fully_visible can only skip the enforcement of noexec and
    nosuid because it is known the filesystems in question do not support
    executables.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

3 commits

  • fs_fully_visible attempts to make fresh mounts of proc and sysfs give
    the mounter no more access to proc and sysfs than if they could have
    by creating a bind mount. One aspect of proc and sysfs that makes
    this particularly tricky is that there are other filesystems that
    typically mount on top of proc and sysfs. As those filesystems are
    mounted on empty directories in practice it is safe to ignore them.
    However testing to ensure filesystems are mounted on empty directories
    has not been something the in kernel data structures have supported so
    the current test for an empty directory which checks to see
    if nlink i_mode) && i_nlink

    Eric W. Biederman
     
  • Limit the mounts fs_fully_visible considers to locked mounts.
    Unlocked can always be unmounted so considering them adds hassle
    but no security benefit.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • A patchset to remove support for passing pre-allocated struct seq_file to
    seq_open(). Such feature is undocumented and prone to error.

    In particular, if seq_release() is used in release handler, it will
    kfree() a pointer which was not allocated by seq_open().

    So this patchset drops support for pre-allocated struct seq_file: it's
    only of use in proc_namespace.c and can be easily replaced by using
    seq_open_private()/seq_release_private().

    Additionally, it documents the use of file->private_data to hold pointer
    to struct seq_file by seq_open().

    This patch (of 3):

    Since patch described below, from v2.6.15-rc1, seq_open() could use a
    struct seq_file already allocated by the caller if the pointer to the
    structure is stored in file->private_data before calling the function.

    Commit 1abe77b0fc4b485927f1f798ae81a752677e1d05
    Author: Al Viro
    Date: Mon Nov 7 17:15:34 2005 -0500

    [PATCH] allow callers of seq_open do allocation themselves

    Allow caller of seq_open() to kmalloc() seq_file + whatever else they
    want and set ->private_data to it. seq_open() will then abstain from
    doing allocation itself.

    Such behavior is only used by mounts_open_common().

    In order to drop support for such uncommon feature, proc_mounts is
    converted to use seq_open_private(), which take care of allocating the
    proc_mounts structure, making it available through ->private in struct
    seq_file.

    Conversely, proc_mounts is converted to use seq_release_private(), in
    order to release the private structure allocated by seq_open_private().

    Then, ->private is used directly instead of proc_mounts() macro to access
    to the proc_mounts structure.

    Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

04 Jun, 2015

1 commit

  • Ignore an existing mount if the locked readonly, nodev or atime
    attributes are less permissive than the desired attributes
    of the new mount.

    On success ensure the new mount locks all of the same readonly, nodev and
    atime attributes as the old mount.

    The nosuid and noexec attributes are not checked here as this change
    is destined for stable and enforcing those attributes causes a
    regression in lxc and libvirt-lxc where those applications will not
    start and there are no known executables on sysfs or proc and no known
    way to create exectuables without code modifications

    Cc: stable@vger.kernel.org
    Fixes: e51db73532955 ("userns: Better restrictions on when proc and sysfs can be mounted")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

14 May, 2015

1 commit

  • Fresh mounts of proc and sysfs are a very special case that works very
    much like a bind mount. Unfortunately the current structure can not
    preserve the MNT_LOCK... mount flags. Therefore refactor the logic
    into a form that can be modified to preserve those lock bits.

    Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
    of the filesystem be fully visible in the current mount namespace,
    before the filesystem may be mounted.

    Move the logic for calling fs_fully_visible from proc and sysfs into
    fs/namespace.c where it has greater access to mount namespace state.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 May, 2015

1 commit

  • same as legitimize_mnt(), except that it does *not* drop and regain
    rcu_read_lock; return values are
    0 => grabbed a reference, we are fine
    1 => failed, just go away
    -1 => failed, go away and mntput(bastard) when outside of rcu_read_lock

    Signed-off-by: Al Viro

    Al Viro
     

10 May, 2015

1 commit


10 Apr, 2015

1 commit

  • Now that it is possible to lazily unmount an entire mount tree and
    leave the individual mounts connected to each other add a new flag
    UMOUNT_CONNECTED to umount_tree to force this behavior and use
    this flag in detach_mounts.

    This closes a bug where the deletion of a file or directory could
    trigger an unmount and reveal data under a mount point.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman