02 Jul, 2022

1 commit

  • commit 705191b03d507744c7e097f78d583621c14988ac upstream.

    Last cycle we extended the idmapped mounts infrastructure to support
    idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
    Since then, the meaning of an idmapped mount is a mount whose idmapping
    is different from the filesystems idmapping.

    While doing that work we missed to adapt the acl translation helpers.
    They still assume that checking for the identity mapping is enough. But
    they need to use the no_idmapping() helper instead.

    Note, POSIX ACLs are always translated right at the userspace-kernel
    boundary using the caller's current idmapping and the initial idmapping.
    The order depends on whether we're coming from or going to userspace.
    The filesystem's idmapping doesn't matter at the border.

    Consequently, if a non-idmapped mount is passed we need to make sure to
    always pass the initial idmapping as the mount's idmapping and not the
    filesystem idmapping. Since it's irrelevant here it would yield invalid
    ids and prevent setting acls for filesystems that are mountable in a
    userns and support posix acls (tmpfs and fuse).

    I verified the regression reported in [1] and verified that this patch
    fixes it. A regression test will be added to xfstests in parallel.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
    Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems")
    Cc: Seth Forshee
    Cc: Christoph Hellwig
    Cc: # 5.15+
    Cc:
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

23 Mar, 2021

1 commit

  • Fix kernel-doc warnings in xattr.c:

    ../fs/xattr.c:257: warning: Function parameter or member 'mnt_userns' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:485: warning: Function parameter or member 'mnt_userns' not described in '__vfs_removexattr_locked'

    and fix one function whose kernel-doc was not in the correct format.

    Link: https://lore.kernel.org/r/20210216042929.8931-4-rdunlap@infradead.org
    Fixes: 71bc356f93a1 ("commoncap: handle idmapped mounts")
    Fixes: b1ab7e4b2a88 ("VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx.")
    Signed-off-by: Randy Dunlap
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christian Brauner
    Cc: David P. Quigley
    Cc: James Morris
    Reviewed-by: Christoph Hellwig
    Acked-by: Christian Brauner
    Signed-off-by: Christian Brauner

    Randy Dunlap
     

24 Jan, 2021

6 commits

  • The may_follow_link(), may_linkat(), may_lookup(), may_open(),
    may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
    helpers determine whether the caller is privileged enough to perform the
    associated operations. Let them handle idmapped mounts by mapping the
    inode or fsids according to the mount's user namespace. Afterwards the
    checks are identical to non-idmapped inodes. The patch takes care to
    retrieve the mount's user namespace right before performing permission
    checks and passing it down into the fileystem so the user namespace
    can't change in between by someone idmapping a mount that is currently
    not idmapped. If the initial user namespace is passed nothing changes so
    non-idmapped mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When interacting with user namespace and non-user namespace aware
    filesystem capabilities the vfs will perform various security checks to
    determine whether or not the filesystem capabilities can be used by the
    caller, whether they need to be removed and so on. The main
    infrastructure for this resides in the capability codepaths but they are
    called through the LSM security infrastructure even though they are not
    technically an LSM or optional. This extends the existing security hooks
    security_inode_removexattr(), security_inode_killpriv(),
    security_inode_getsecurity() to pass down the mount's user namespace and
    makes them aware of idmapped mounts.

    In order to actually get filesystem capabilities from disk the
    capability infrastructure exposes the get_vfs_caps_from_disk() helper.
    For user namespace aware filesystem capabilities a root uid is stored
    alongside the capabilities.

    In order to determine whether the caller can make use of the filesystem
    capability or whether it needs to be ignored it is translated according
    to the superblock's user namespace. If it can be translated to uid 0
    according to that id mapping the caller can use the filesystem
    capabilities stored on disk. If we are accessing the inode that holds
    the filesystem capabilities through an idmapped mount we map the root
    uid according to the mount's user namespace. Afterwards the checks are
    identical to non-idmapped mounts: reading filesystem caps from disk
    enforces that the root uid associated with the filesystem capability
    must have a mapping in the superblock's user namespace and that the
    caller is either in the same user namespace or is a descendant of the
    superblock's user namespace. For filesystems that are mountable inside
    user namespace the caller can just mount the filesystem and won't
    usually need to idmap it. If they do want to idmap it they can create an
    idmapped mount and mark it with a user namespace they created and which
    is thus a descendant of s_user_ns. For filesystems that are not
    mountable inside user namespaces the descendant rule is trivially true
    because the s_user_ns will be the initial user namespace.

    If the initial user namespace is passed nothing changes so non-idmapped
    mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Acked-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When interacting with extended attributes the vfs verifies that the
    caller is privileged over the inode with which the extended attribute is
    associated. For posix access and posix default extended attributes a uid
    or gid can be stored on-disk. Let the functions handle posix extended
    attributes on idmapped mounts. If the inode is accessed through an
    idmapped mount we need to map it according to the mount's user
    namespace. Afterwards the checks are identical to non-idmapped mounts.
    This has no effect for e.g. security xattrs since they don't store uids
    or gids and don't perform permission checks on them like posix acls do.

    Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Tycho Andersen
    Signed-off-by: Christian Brauner

    Tycho Andersen
     
  • The posix acl permission checking helpers determine whether a caller is
    privileged over an inode according to the acls associated with the
    inode. Add helpers that make it possible to handle acls on idmapped
    mounts.

    The vfs and the filesystems targeted by this first iteration make use of
    posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
    translate basic posix access and default permissions such as the
    ACL_USER and ACL_GROUP type according to the initial user namespace (or
    the superblock's user namespace) to and from the caller's current user
    namespace. Adapt these two helpers to handle idmapped mounts whereby we
    either map from or into the mount's user namespace depending on in which
    direction we're translating.
    Similarly, cap_convert_nscap() is used by the vfs to translate user
    namespace and non-user namespace aware filesystem capabilities from the
    superblock's user namespace to the caller's user namespace. Enable it to
    handle idmapped mounts by accounting for the mount's user namespace.

    In addition the fileystems targeted in the first iteration of this patch
    series make use of the posix_acl_chmod() and, posix_acl_update_mode()
    helpers. Both helpers perform permission checks on the target inode. Let
    them handle idmapped mounts. These two helpers are called when posix
    acls are set by the respective filesystems to handle this case we extend
    the ->set() method to take an additional user namespace argument to pass
    the mount's user namespace down.

    Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The inode_owner_or_capable() helper determines whether the caller is the
    owner of the inode or is capable with respect to that inode. Allow it to
    handle idmapped mounts. If the inode is accessed through an idmapped
    mount it according to the mount's user namespace. Afterwards the checks
    are identical to non-idmapped mounts. If the initial user namespace is
    passed nothing changes so non-idmapped mounts will see identical
    behavior as before.

    Similarly, allow the inode_init_owner() helper to handle idmapped
    mounts. It initializes a new inode on idmapped mounts by mapping the
    fsuid and fsgid of the caller from the mount's user namespace. If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The two helpers inode_permission() and generic_permission() are used by
    the vfs to perform basic permission checking by verifying that the
    caller is privileged over an inode. In order to handle idmapped mounts
    we extend the two helpers with an additional user namespace argument.
    On idmapped mounts the two helpers will make sure to map the inode
    according to the mount's user namespace and then peform identical
    permission checks to inode_permission() and generic_permission(). If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     

14 Dec, 2020

1 commit


14 Oct, 2020

1 commit

  • Fix kernel-doc warnings in fs/xattr.c:

    ../fs/xattr.c:251: warning: Function parameter or member 'dentry' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:251: warning: Function parameter or member 'name' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:251: warning: Function parameter or member 'value' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:251: warning: Function parameter or member 'size' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:251: warning: Function parameter or member 'flags' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:251: warning: Function parameter or member 'delegated_inode' not described in '__vfs_setxattr_locked'
    ../fs/xattr.c:458: warning: Function parameter or member 'dentry' not described in '__vfs_removexattr_locked'
    ../fs/xattr.c:458: warning: Function parameter or member 'name' not described in '__vfs_removexattr_locked'
    ../fs/xattr.c:458: warning: Function parameter or member 'delegated_inode' not described in '__vfs_removexattr_locked'

    Fixes: 08b5d5014a27 ("xattr: break delegations in {set,remove}xattr")
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Frank van der Linden
    Cc: Chuck Lever
    Link: http://lkml.kernel.org/r/7a3dd5a2-5787-adf3-d525-c203f9910ec4@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

14 Jul, 2020

2 commits

  • Add a function that checks is an extended attribute namespace is
    supported for an inode, meaning that a handler must be present
    for either the whole namespace, or at least one synthetic
    xattr in the namespace.

    To be used by the nfs server code when being queried for extended
    attributes support.

    Cc: linux-fsdevel@vger.kernel.org
    Cc: Al Viro
    Signed-off-by: Frank van der Linden
    Signed-off-by: Chuck Lever

    Frank van der Linden
     
  • set/removexattr on an exported filesystem should break NFS delegations.
    This is true in general, but also for the upcoming support for
    RFC 8726 (NFSv4 extended attribute support). Make sure that they do.

    Additionally, they need to grow a _locked variant, since callers might
    call this with i_rwsem held (like the NFS server code).

    Cc: stable@vger.kernel.org # v4.9+
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Al Viro
    Signed-off-by: Frank van der Linden
    Signed-off-by: Chuck Lever

    Frank van der Linden
     

10 Apr, 2020

1 commit

  • `removed_sized` isn't correctly initialized (as the doc comment
    suggests) on memory allocation failures. Fix by moving initialization up
    a bit.

    Fixes: 0c47383ba3bd ("kernfs: Add option to enable user xattrs")
    Reported-by: Dan Carpenter
    Signed-off-by: Daniel Xu
    Signed-off-by: Tejun Heo

    Daniel Xu
     

17 Mar, 2020

2 commits

  • This helps set up size accounting in the next commit. Without this out
    param, it's difficult to find out the removed xattr size without taking
    a lock for longer and walking the xattr linked list twice.

    Signed-off-by: Daniel Xu
    Acked-by: Chris Down
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Tejun Heo

    Daniel Xu
     
  • xattr values have a 64k maximum size. This can result in an order 4
    kmalloc request which can be difficult to fulfill. Since xattrs do not
    need physically contiguous memory, we can switch to kvmalloc and not
    have to worry about higher order allocations failing.

    Signed-off-by: Daniel Xu
    Acked-by: Chris Down
    Reviewed-by: Andreas Dilger
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Tejun Heo

    Daniel Xu
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

18 Sep, 2018

1 commit

  • Commit 786534b92f3c introduced a regression that caused listxattr to
    return the POSIX ACL attribute names even though sysfs doesn't support
    POSIX ACLs. This happens because simple_xattr_list checks for NULL
    i_acl / i_default_acl, but inode_init_always initializes those fields
    to ACL_NOT_CACHED ((void *)-1). For example:
    $ getfattr -m- -d /sys
    /sys: system.posix_acl_access: Operation not supported
    /sys: system.posix_acl_default: Operation not supported
    Fix this in simple_xattr_list by checking if the filesystem supports POSIX ACLs.

    Fixes: 786534b92f3c ("tmpfs: listxattr should include POSIX ACL xattrs")
    Reported-by: Marc Aurèle La France
    Tested-by: Marc Aurèle La France
    Signed-off-by: Andreas Gruenbacher
    Cc: stable@vger.kernel.org # v4.5+
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

25 Aug, 2018

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of four fairly obvious bug fixes:

    - a switch from d_find_alias to d_find_any_alias because the xattr
    code perversely takes a dentry

    - two mutex vs copy_to_user fixes from Jann Horn

    - a fix to use a sanitized size not the size userspace passed in from
    Christian Brauner"

    * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    getxattr: use correct xattr length
    sys: don't hold uts_sem while accessing userspace memory
    userns: move user access out of the mutex
    cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()

    Linus Torvalds
     

24 Aug, 2018

1 commit

  • When running in a container with a user namespace, if you call getxattr
    with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
    silently skips the user namespace fixup that it normally does resulting in
    un-fixed-up data being returned.
    This is caused by posix_acl_fix_xattr_to_user() being passed the total
    buffer size and not the actual size of the xattr as returned by
    vfs_getxattr().
    This commit passes the actual length of the xattr as returned by
    vfs_getxattr() down.

    A reproducer for the issue is:

    touch acl_posix

    setfacl -m user:0:rwx acl_posix

    and the compile:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
    int main(int argc, void **argv)
    {
    ssize_t ret1, ret2;
    char buf1[128], buf2[132];
    int fret = EXIT_SUCCESS;
    char *file;

    if (argc < 2) {
    fprintf(stderr,
    "Please specify a file with "
    "\"system.posix_acl_access\" permissions set\n");
    _exit(EXIT_FAILURE);
    }
    file = argv[1];

    ret1 = getxattr(file, "system.posix_acl_access",
    buf1, sizeof(buf1));
    if (ret1 < 0) {
    fprintf(stderr, "%s - Failed to retrieve "
    "\"system.posix_acl_access\" "
    "from \"%s\"\n", strerror(errno), file);
    _exit(EXIT_FAILURE);
    }

    ret2 = getxattr(file, "system.posix_acl_access",
    buf2, sizeof(buf2));
    if (ret2 < 0) {
    fprintf(stderr, "%s - Failed to retrieve "
    "\"system.posix_acl_access\" "
    "from \"%s\"\n", strerror(errno), file);
    _exit(EXIT_FAILURE);
    }

    if (ret1 != ret2) {
    fprintf(stderr, "The value of \"system.posix_acl_"
    "access\" for file \"%s\" changed "
    "between two successive calls\n", file);
    _exit(EXIT_FAILURE);
    }

    for (ssize_t i = 0; i < ret2; i++) {
    if (buf1[i] == buf2[i])
    continue;

    fprintf(stderr,
    "Unexpected different in byte %zd: "
    "%02x != %02x\n", i, buf1[i], buf2[i]);
    fret = EXIT_FAILURE;
    }

    if (fret == EXIT_SUCCESS)
    fprintf(stderr, "Test passed\n");
    else
    fprintf(stderr, "Test failed\n");

    _exit(fret);
    }
    and run:

    ./tester acl_posix

    On a non-fixed up kernel this should return something like:

    root@c1:/# ./t
    Unexpected different in byte 16: ffffffa0 != 00
    Unexpected different in byte 17: ffffff86 != 00
    Unexpected different in byte 18: 01 != 00

    and on a fixed kernel:

    root@c1:~# ./t
    Test passed

    Cc: stable@vger.kernel.org
    Fixes: 2f6f0654ab61 ("userns: Convert vfs posix_acl support to use kuids and kgids")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
    Reported-by: Colin Watson
    Signed-off-by: Christian Brauner
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Christian Brauner
     

18 Jul, 2018

1 commit


30 May, 2018

1 commit


14 May, 2018

1 commit


04 Oct, 2017

1 commit

  • security_inode_getsecurity() provides the text string value
    of a security attribute. It does not provide a "secctx".
    The code in xattr_getsecurity() that calls security_inode_getsecurity()
    and then calls security_release_secctx() happened to work because
    SElinux and Smack treat the attribute and the secctx the same way.
    It fails for cap_inode_getsecurity(), because that module has no
    secctx that ever needs releasing. It turns out that Smack is the
    one that's doing things wrong by not allocating memory when instructed
    to do so by the "alloc" parameter.

    The fix is simple enough. Change the security_release_secctx() to
    kfree() because it isn't a secctx being returned by
    security_inode_getsecurity(). Change Smack to allocate the string when
    told to do so.

    Note: this also fixes memory leaks for LSMs which implement
    inode_getsecurity but not release_secctx, such as capabilities.

    Signed-off-by: Casey Schaufler
    Reported-by: Konstantin Khlebnikov
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Casey Schaufler
     

14 Sep, 2017

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This fixes d_ino correctness in readdir, which brings overlayfs on par
    with normal filesystems regarding inode number semantics, as long as
    all layers are on the same filesystem.

    There are also some bug fixes, one in particular (random ioctl's
    shouldn't be able to modify lower layers) that touches some vfs code,
    but of course no-op for non-overlay fs"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix false positive ESTALE on lookup
    ovl: don't allow writing ioctl on lower layer
    ovl: fix relatime for directories
    vfs: add flags to d_real()
    ovl: cleanup d_real for negative
    ovl: constant d_ino for non-merge dirs
    ovl: constant d_ino across copy up
    ovl: fix readdir error value
    ovl: check snprintf return

    Linus Torvalds
     

05 Sep, 2017

1 commit

  • Problem with ioctl() is that it's a file operation, yet often used as an
    inode operation (i.e. modify the inode despite the file being opened for
    read-only).

    mnt_want_write_file() is used by filesystems in such cases to get write
    access on an arbitrary open file.

    Since overlayfs lets filesystems do all file operations, including ioctl,
    this can lead to mnt_want_write_file() returning OK for a lower file and
    modification of that lower file.

    This patch prevents modification by checking if the file is from an
    overlayfs lower layer and returning EPERM in that case.

    Need to introduce a mnt_want_write_file_path() variant that still does the
    old thing for inode operations that can do the copy up + modification
    correctly in such cases (fchown, fsetxattr, fremovexattr).

    This does not address the correctness of such ioctls on overlayfs (the
    correct way would be to copy up and attempt to perform ioctl on upper
    file).

    In theory this could be a regression. We very much hope that nobody is
    relying on such a hack in any sane setup.

    While this patch meddles in VFS code, it has no effect on non-overlayfs
    filesystems.

    Reported-by: "zhangyi (F)"
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

02 Sep, 2017

1 commit

  • Root in a non-initial user ns cannot be trusted to write a traditional
    security.capability xattr. If it were allowed to do so, then any
    unprivileged user on the host could map his own uid to root in a private
    namespace, write the xattr, and execute the file with privilege on the
    host.

    However supporting file capabilities in a user namespace is very
    desirable. Not doing so means that any programs designed to run with
    limited privilege must continue to support other methods of gaining and
    dropping privilege. For instance a program installer must detect
    whether file capabilities can be assigned, and assign them if so but set
    setuid-root otherwise. The program in turn must know how to drop
    partial capabilities, and do so only if setuid-root.

    This patch introduces v3 of the security.capability xattr. It builds a
    vfs_ns_cap_data struct by appending a uid_t rootid to struct
    vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user
    namespace which mounted the filesystem, usually init_user_ns) of the
    root id in whose namespaces the file capabilities may take effect.

    When a task asks to write a v2 security.capability xattr, if it is
    privileged with respect to the userns which mounted the filesystem, then
    nothing should change. Otherwise, the kernel will transparently rewrite
    the xattr as a v3 with the appropriate rootid. This is done during the
    execution of setxattr() to catch user-space-initiated capability writes.
    Subsequently, any task executing the file which has the noted kuid as
    its root uid, or which is in a descendent user_ns of such a user_ns,
    will run the file with capabilities.

    Similarly when asking to read file capabilities, a v3 capability will
    be presented as v2 if it applies to the caller's namespace.

    If a task writes a v3 security.capability, then it can provide a uid for
    the xattr so long as the uid is valid in its own user namespace, and it
    is privileged with CAP_SETFCAP over its namespace. The kernel will
    translate that rootid to an absolute uid, and write that to disk. After
    this, a task in the writer's namespace will not be able to use those
    capabilities (unless rootid was 0), but a task in a namespace where the
    given uid is root will.

    Only a single security.capability xattr may exist at a time for a given
    file. A task may overwrite an existing xattr so long as it is
    privileged over the inode. Note this is a departure from previous
    semantics, which required privilege to remove a security.capability
    xattr. This check can be re-added if deemed useful.

    This allows a simple setxattr to work, allows tar/untar to work, and
    allows us to tar in one namespace and untar in another while preserving
    the capability, without risking leaking privilege into a parent
    namespace.

    Example using tar:

    $ cp /bin/sleep sleepx
    $ mkdir b1 b2
    $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
    $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
    $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
    $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
    $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
    b2/sleepx = cap_sys_admin+ep
    # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
    v3 xattr, rootid is 100001

    A patch to linux-test-project adding a new set of tests for this
    functionality is in the nsfscaps branch at github.com/hallyn/ltp

    Changelog:
    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
    Nov 07 2016: convert rootid from and to fs user_ns
    (From ebiederm: mar 28 2017)
    commoncap.c: fix typos - s/v4/v3
    get_vfs_caps_from_disk: clarify the fs_ns root access check
    nsfscaps: change the code split for cap_inode_setxattr()
    Apr 09 2017:
    don't return v3 cap for caps owned by current root.
    return a v2 cap for a true v2 cap in non-init ns
    Apr 18 2017:
    . Change the flow of fscap writing to support s_user_ns writing.
    . Remove refuse_fcap_overwrite(). The value of the previous
    xattr doesn't matter.
    Apr 24 2017:
    . incorporate Eric's incremental diff
    . move cap_convert_nscap to setxattr and simplify its usage
    May 8, 2017:
    . fix leaking dentry refcount in cap_inode_getsecurity

    Signed-off-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Serge E. Hallyn
     

09 May, 2017

2 commits

  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • getxattr uses vmalloc to allocate memory if kzalloc fails. This is
    filled by vfs_getxattr and then copied to the userspace. vmalloc,
    however, doesn't zero out the memory so if the specific implementation
    of the xattr handler is sloppy we can theoretically expose a kernel
    memory. There is no real sign this is really the case but let's make
    sure this will not happen and use vzalloc instead.

    Fixes: 779302e67835 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
    Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.org
    Acked-by: Kees Cook
    Reported-by: Vlastimil Babka
    Signed-off-by: Michal Hocko
    Cc: [3.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Dec, 2016

1 commit


17 Nov, 2016

1 commit

  • The IOP_XATTR flag is set on sockfs because sockfs supports getting the
    "system.sockprotoname" xattr. Since commit 6c6ef9f2, this flag is checked for
    setxattr support as well. This is wrong on sockfs because security xattr
    support there is supposed to be provided by security_inode_setsecurity. The
    smack security module relies on socket labels (xattrs).

    Fix this by adding a security xattr handler on sockfs that returns
    -EAGAIN, and by checking for -EAGAIN in setxattr.

    We cannot simply check for -EOPNOTSUPP in setxattr because there are
    filesystems that neither have direct security xattr support nor support
    via security_inode_setsecurity. A more proper fix might be to move the
    call to security_inode_setsecurity into sockfs, but it's not clear to me
    if that is safe: we would end up calling security_inode_post_setxattr after
    that as well.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

08 Oct, 2016

7 commits

  • These inode operations are no longer used; remove them.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • All filesystems that support xattrs by now do so via xattr handlers.
    They all define sb->s_xattr, and their getxattr, setxattr, and
    removexattr inode operations use the generic inode operations. On
    filesystems that don't support xattrs, the xattr inode operations are
    all NULL, and sb->s_xattr is also NULL.

    This means that we can remove the getxattr, setxattr, and removexattr
    inode operations and directly call the generic handlers, or better,
    inline expand those handlers into fs/xattr.c.

    Filesystems that do not support xattrs on some inodes should clear the
    IOP_XATTR i_opflags flag in those inodes. (Right now, some filesystems
    have checks to disable xattrs on some inodes in the ->list, ->get, and
    ->set xattr handler operations instead.) The IOP_XATTR flag is
    automatically cleared in inodes of filesystems that don't have xattr
    support.

    In orangefs, symlinks do have a setxattr iop but no getxattr iop. Add a
    check for symlinks to orangefs_inode_getxattr to preserve the current,
    weird behavior; that check may not be necessary though.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • When an inode doesn't support xattrs, turn listxattr off as well.

    (When xattrs are "turned off", the VFS still passes security xattr
    operations through to security modules, which can still expose inode
    security labels that way.)

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Right now, various places in the kernel check for the existence of
    getxattr, setxattr, and removexattr inode operations and directly call
    those operations. Switch to helper functions and test for the IOP_XATTR
    flag instead.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: James Morris
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • With this change, all the xattr handler based operations will produce an
    -EIO result for bad inodes, and we no longer only depend on inode->i_op
    to be set to bad_inode_ops.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • The IOP_XATTR inode operations flag in inode->i_opflags indicates that
    the inode has xattr support. The flag is automatically set by
    new_inode() on filesystems with xattr support (where sb->s_xattr is
    defined), and cleared otherwise. Filesystems can explicitly clear it
    for inodes that should not have xattr support.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

07 Oct, 2016

1 commit


06 Jul, 2016

1 commit

  • When a filesystem outside of init_user_ns is mounted it could have
    uids and gids stored in it that do not map to init_user_ns.

    The plan is to allow those filesystems to set i_uid to INVALID_UID and
    i_gid to INVALID_GID for unmapped uids and gids and then to handle
    that strange case in the vfs to ensure there is consistent robust
    handling of the weirdness.

    Upon a careful review of the vfs and filesystems about the only case
    where there is any possibility of confusion or trouble is when the
    inode is written back to disk. In that case filesystems typically
    read the inode->i_uid and inode->i_gid and write them to disk even
    when just an inode timestamp is being updated.

    Which leads to a rule that is very simple to implement and understand
    inodes whose i_uid or i_gid is not valid may not be written.

    In dealing with access times this means treat those inodes as if the
    inode flag S_NOATIME was set. Reads of the inodes appear safe and
    useful, but any write or modification is disallowed. The only inode
    write that is allowed is a chown that sets the uid and gid on the
    inode to valid values. After such a chown the inode is normal and may
    be treated as such.

    Denying all writes to inodes with uids or gids unknown to the vfs also
    prevents several oddball cases where corruption would have occurred
    because the vfs does not have complete information.

    One problem case that is prevented is attempting to use the gid of a
    directory for new inodes where the directories sgid bit is set but the
    directories gid is not mapped.

    Another problem case avoided is attempting to update the evm hash
    after setxattr, removexattr, and setattr. As the evm hash includeds
    the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
    a correct evm hash from being computed. evm hash verification also
    fails when i_uid or i_gid is unknown but that is essentially harmless
    as it does not cause filesystem corruption.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro