11 Aug, 2018

1 commit

  • The code in cap_inode_getsecurity(), introduced by commit 8db6c34f1dbc
    ("Introduce v3 namespaced file capabilities"), should use
    d_find_any_alias() instead of d_find_alias() do handle unhashed dentry
    correctly. This is needed, for example, if execveat() is called with an
    open but unlinked overlayfs file, because overlayfs unhashes dentry on
    unlink.
    This is a regression of real life application, first reported at
    https://www.spinics.net/lists/linux-unionfs/msg05363.html

    Below reproducer and setup can reproduce the case.
    const char* exec="echo";
    const char *newargv[] = { "echo", "hello", NULL};
    const char *newenviron[] = { NULL };
    int fd, err;

    fd = open(exec, O_PATH);
    unlink(exec);
    err = syscall(322/*SYS_execveat*/, fd, "", newargv, newenviron,
    AT_EMPTY_PATH);
    if(err
    Acked-by: Amir Goldstein
    Acked-by: Serge E. Hallyn
    Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities")
    Cc: # v4.14
    Signed-off-by: Eddie Horng
    Signed-off-by: Eric W. Biederman

    Eddie.Horng
     

25 May, 2018

1 commit

  • A privileged user in s_user_ns will generally have the ability to
    manipulate the backing store and insert security.* xattrs into
    the filesystem directly. Therefore the kernel must be prepared to
    handle these xattrs from unprivileged mounts, and it makes little
    sense for commoncap to prevent writing these xattrs to the
    filesystem. The capability and LSM code have already been updated
    to appropriately handle xattrs from unprivileged mounts, so it
    is safe to loosen this restriction on setting xattrs.

    The exception to this logic is that writing xattrs to a mounted
    filesystem may also cause the LSM inode_post_setxattr or
    inode_setsecurity callbacks to be invoked. SELinux will deny the
    xattr update by virtue of applying mountpoint labeling to
    unprivileged userns mounts, and Smack will deny the writes for
    any user without global CAP_MAC_ADMIN, so loosening the
    capability check in commoncap is safe in this respect as well.

    Signed-off-by: Seth Forshee
    Acked-by: Serge Hallyn
    Acked-by: Christian Brauner
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Apr, 2018

1 commit

  • syzbot is reporting NULL pointer dereference at xattr_getsecurity() [1],
    for cap_inode_getsecurity() is returning sizeof(struct vfs_cap_data) when
    memory allocation failed. Return -ENOMEM if memory allocation failed.

    [1] https://syzkaller.appspot.com/bug?id=a55ba438506fe68649a5f50d2d82d56b365e0107

    Signed-off-by: Tetsuo Handa
    Fixes: 8db6c34f1dbc8e06 ("Introduce v3 namespaced file capabilities")
    Reported-by: syzbot
    Cc: stable # 4.14+
    Acked-by: Serge E. Hallyn
    Acked-by: James Morris
    Signed-off-by: Eric W. Biederman

    Tetsuo Handa
     

02 Jan, 2018

1 commit

  • If userspace attempted to set a "security.capability" xattr shorter than
    4 bytes (e.g. 'setfattr -n security.capability -v x file'), then
    cap_convert_nscap() read past the end of the buffer containing the xattr
    value because it accessed the ->magic_etc field without verifying that
    the xattr value is long enough to contain that field.

    Fix it by validating the xattr value size first.

    This bug was found using syzkaller with KASAN. The KASAN report was as
    follows (cleaned up slightly):

    BUG: KASAN: slab-out-of-bounds in cap_convert_nscap+0x514/0x630 security/commoncap.c:498
    Read of size 4 at addr ffff88002d8741c0 by task syz-executor1/2852

    CPU: 0 PID: 2852 Comm: syz-executor1 Not tainted 4.15.0-rc6-00200-gcc0aac99d977 #253
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0xe3/0x195 lib/dump_stack.c:53
    print_address_description+0x73/0x260 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x235/0x350 mm/kasan/report.c:409
    cap_convert_nscap+0x514/0x630 security/commoncap.c:498
    setxattr+0x2bd/0x350 fs/xattr.c:446
    path_setxattr+0x168/0x1b0 fs/xattr.c:472
    SYSC_setxattr fs/xattr.c:487 [inline]
    SyS_setxattr+0x36/0x50 fs/xattr.c:483
    entry_SYSCALL_64_fastpath+0x18/0x85

    Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities")
    Cc: # v4.14+
    Signed-off-by: Eric Biggers
    Reviewed-by: Serge Hallyn
    Signed-off-by: James Morris

    Eric Biggers
     

14 Nov, 2017

1 commit

  • Pull general security subsystem updates from James Morris:
    "TPM (from Jarkko):
    - essential clean up for tpm_crb so that ARM64 and x86 versions do
    not distract each other as much as before

    - /dev/tpm0 rejects now too short writes (shorter buffer than
    specified in the command header

    - use DMA-safe buffer in tpm_tis_spi

    - otherwise mostly minor fixes.

    Smack:
    - base support for overlafs

    Capabilities:
    - BPRM_FCAPS fixes, from Richard Guy Briggs:

    The audit subsystem is adding a BPRM_FCAPS record when auditing
    setuid application execution (SYSCALL execve). This is not expected
    as it was supposed to be limited to when the file system actually
    had capabilities in an extended attribute. It lists all
    capabilities making the event really ugly to parse what is
    happening. The PATH record correctly records the setuid bit and
    owner. Suppress the BPRM_FCAPS record on set*id.

    TOMOYO:
    - Y2038 timestamping fixes"

    * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (28 commits)
    MAINTAINERS: update the IMA, EVM, trusted-keys, encrypted-keys entries
    Smack: Base support for overlayfs
    MAINTAINERS: remove David Safford as maintainer for encrypted+trusted keys
    tomoyo: fix timestamping for y2038
    capabilities: audit log other surprising conditions
    capabilities: fix logic for effective root or real root
    capabilities: invert logic for clarity
    capabilities: remove a layer of conditional logic
    capabilities: move audit log decision to function
    capabilities: use intuitive names for id changes
    capabilities: use root_priveleged inline to clarify logic
    capabilities: rename has_cap to has_fcap
    capabilities: intuitive names for cap gain status
    capabilities: factor out cap_bprm_set_creds privileged root
    tpm, tpm_tis: use ARRAY_SIZE() to define TPM_HID_USR_IDX
    tpm: fix duplicate inline declaration specifier
    tpm: fix type of a local variables in tpm_tis_spi.c
    tpm: fix type of a local variable in tpm2_map_command()
    tpm: fix type of a local variable in tpm2_get_cc_attrs_tbl()
    tpm-dev-common: Reject too short writes
    ...

    Linus Torvalds
     

20 Oct, 2017

10 commits

  • The existing condition tested for process effective capabilities set by
    file attributes but intended to ignore the change if the result was
    unsurprisingly an effective full set in the case root is special with a
    setuid root executable file and we are root.

    Stated again:
    - When you execute a setuid root application, it is no surprise and
    expected that it got all capabilities, so we do not want capabilities
    recorded.
    if (pE_grew && !(pE_fullset && (eff_root || real_root) && root_priveleged) )

    Now make sure we cover other cases:
    - If something prevented a setuid root app getting all capabilities and
    it wound up with one capability only, then it is a surprise and should
    be logged. When it is a setuid root file, we only want capabilities
    when the process does not get full capabilities..
    root_priveleged && setuid_root && !pE_fullset

    - Similarly if a non-setuid program does pick up capabilities due to
    file system based capabilities, then we want to know what capabilities
    were picked up. When it has file system based capabilities we want
    the capabilities.
    !is_setuid && (has_fcap && pP_gained)

    - If it is a non-setuid file and it gets ambient capabilities, we want
    the capabilities.
    !is_setuid && pA_gained

    - These last two are combined into one due to the common first parameter.

    Related: https://github.com/linux-audit/audit-kernel/issues/16

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Acked-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Now that the logic is inverted, it is much easier to see that both real
    root and effective root conditions had to be met to avoid printing the
    BPRM_FCAPS record with audit syscalls. This meant that any setuid root
    applications would print a full BPRM_FCAPS record when it wasn't
    necessary, cluttering the event output, since the SYSCALL and PATH
    records indicated the presence of the setuid bit and effective root user
    id.

    Require only one of effective root or real root to avoid printing the
    unnecessary record.

    Ref: commit 3fc689e96c0c ("Add audit_log_bprm_fcaps/AUDIT_BPRM_FCAPS")
    See: https://github.com/linux-audit/audit-kernel/issues/16

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Acked-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • The way the logic was presented, it was awkward to read and verify.
    Invert the logic using DeMorgan's Law to be more easily able to read and
    understand.

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Remove a layer of conditional logic to make the use of conditions
    easier to read and analyse.

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Move the audit log decision logic to its own function to isolate the
    complexity in one place.

    Suggested-by: Serge Hallyn
    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Introduce a number of inlines to make the use of the negation of
    uid_eq() easier to read and analyse.

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Introduce inline root_privileged() to make use of SECURE_NONROOT
    easier to read.

    Suggested-by: Serge Hallyn
    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Rename has_cap to has_fcap to clarify it applies to file capabilities
    since the entire source file is about capabilities.

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Introduce macros cap_gained, cap_grew, cap_full to make the use of the
    negation of is_subset() easier to read and analyse.

    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     
  • Factor out the case of privileged root from the function
    cap_bprm_set_creds() to make the latter easier to read and analyse.

    Suggested-by: Serge Hallyn
    Signed-off-by: Richard Guy Briggs
    Reviewed-by: Serge Hallyn
    Acked-by: James Morris
    Acked-by: Kees Cook
    Okay-ished-by: Paul Moore
    Signed-off-by: James Morris

    Richard Guy Briggs
     

19 Oct, 2017

1 commit

  • The pointer fs_ns is assigned from inode->i_ib->s_user_ns before
    a null pointer check on inode, hence if inode is actually null we
    will get a null pointer dereference on this assignment. Fix this
    by only dereferencing inode after the null pointer check on
    inode.

    Detected by CoverityScan CID#1455328 ("Dereference before null check")

    Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities")
    Signed-off-by: Colin Ian King
    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Colin Ian King
     

25 Sep, 2017

1 commit

  • Pull misc security layer update from James Morris:
    "This is the remaining 'general' change in the security tree for v4.14,
    following the direct merging of SELinux (+ TOMOYO), AppArmor, and
    seccomp.

    That's everything now for the security tree except IMA, which will
    follow shortly (I've been traveling for the past week with patchy
    internet)"

    * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    security: fix description of values returned by cap_inode_need_killpriv

    Linus Torvalds
     

24 Sep, 2017

1 commit


12 Sep, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "Life has been busy and I have not gotten half as much done this round
    as I would have liked. I delayed it so that a minor conflict
    resolution with the mips tree could spend a little time in linux-next
    before I sent this pull request.

    This includes two long delayed user namespace changes from Kirill
    Tkhai. It also includes a very useful change from Serge Hallyn that
    allows the security capability attribute to be used inside of user
    namespaces. The practical effect of this is people can now untar
    tarballs and install rpms in user namespaces. It had been suggested to
    generalize this and encode some of the namespace information
    information in the xattr name. Upon close inspection that makes the
    things that should be hard easy and the things that should be easy
    more expensive.

    Then there is my bugfix/cleanup for signal injection that removes the
    magic encoding of the siginfo union member from the kernel internal
    si_code. The mips folks reported the case where I had used FPE_FIXME
    me is impossible so I have remove FPE_FIXME from mips, while at the
    same time including a return statement in that case to keep gcc from
    complaining about unitialized variables.

    I almost finished the work to get make copy_siginfo_to_user a trivial
    copy to user. The code is available at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

    But I did not have time/energy to get the code posted and reviewed
    before the merge window opened.

    I was able to see that the security excuse for just copying fields
    that we know are initialized doesn't work in practice there are buggy
    initializations that don't initialize the proper fields in siginfo. So
    we still sometimes copy unitialized data to userspace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Introduce v3 namespaced file capabilities
    mips/signal: In force_fcr31_sig return in the impossible case
    signal: Remove kernel interal si_code magic
    fcntl: Don't use ambiguous SIG_POLL si_codes
    prctl: Allow local CAP_SYS_ADMIN changing exe_file
    security: Use user_namespace::level to avoid redundant iterations in cap_capable()
    userns,pidns: Verify the userns for new pid namespaces
    signal/testing: Don't look for __SI_FAULT in userspace
    signal/mips: Document a conflict with SI_USER with SIGFPE
    signal/sparc: Document a conflict with SI_USER with SIGFPE
    signal/ia64: Document a conflict with SI_USER with SIGFPE
    signal/alpha: Document a conflict with SI_USER for SIGTRAP

    Linus Torvalds
     

02 Sep, 2017

1 commit

  • Root in a non-initial user ns cannot be trusted to write a traditional
    security.capability xattr. If it were allowed to do so, then any
    unprivileged user on the host could map his own uid to root in a private
    namespace, write the xattr, and execute the file with privilege on the
    host.

    However supporting file capabilities in a user namespace is very
    desirable. Not doing so means that any programs designed to run with
    limited privilege must continue to support other methods of gaining and
    dropping privilege. For instance a program installer must detect
    whether file capabilities can be assigned, and assign them if so but set
    setuid-root otherwise. The program in turn must know how to drop
    partial capabilities, and do so only if setuid-root.

    This patch introduces v3 of the security.capability xattr. It builds a
    vfs_ns_cap_data struct by appending a uid_t rootid to struct
    vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user
    namespace which mounted the filesystem, usually init_user_ns) of the
    root id in whose namespaces the file capabilities may take effect.

    When a task asks to write a v2 security.capability xattr, if it is
    privileged with respect to the userns which mounted the filesystem, then
    nothing should change. Otherwise, the kernel will transparently rewrite
    the xattr as a v3 with the appropriate rootid. This is done during the
    execution of setxattr() to catch user-space-initiated capability writes.
    Subsequently, any task executing the file which has the noted kuid as
    its root uid, or which is in a descendent user_ns of such a user_ns,
    will run the file with capabilities.

    Similarly when asking to read file capabilities, a v3 capability will
    be presented as v2 if it applies to the caller's namespace.

    If a task writes a v3 security.capability, then it can provide a uid for
    the xattr so long as the uid is valid in its own user namespace, and it
    is privileged with CAP_SETFCAP over its namespace. The kernel will
    translate that rootid to an absolute uid, and write that to disk. After
    this, a task in the writer's namespace will not be able to use those
    capabilities (unless rootid was 0), but a task in a namespace where the
    given uid is root will.

    Only a single security.capability xattr may exist at a time for a given
    file. A task may overwrite an existing xattr so long as it is
    privileged over the inode. Note this is a departure from previous
    semantics, which required privilege to remove a security.capability
    xattr. This check can be re-added if deemed useful.

    This allows a simple setxattr to work, allows tar/untar to work, and
    allows us to tar in one namespace and untar in another while preserving
    the capability, without risking leaking privilege into a parent
    namespace.

    Example using tar:

    $ cp /bin/sleep sleepx
    $ mkdir b1 b2
    $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
    $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
    $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
    $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
    $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
    b2/sleepx = cap_sys_admin+ep
    # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
    v3 xattr, rootid is 100001

    A patch to linux-test-project adding a new set of tests for this
    functionality is in the nsfscaps branch at github.com/hallyn/ltp

    Changelog:
    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
    Nov 07 2016: convert rootid from and to fs user_ns
    (From ebiederm: mar 28 2017)
    commoncap.c: fix typos - s/v4/v3
    get_vfs_caps_from_disk: clarify the fs_ns root access check
    nsfscaps: change the code split for cap_inode_setxattr()
    Apr 09 2017:
    don't return v3 cap for caps owned by current root.
    return a v2 cap for a true v2 cap in non-init ns
    Apr 18 2017:
    . Change the flow of fscap writing to support s_user_ns writing.
    . Remove refuse_fcap_overwrite(). The value of the previous
    xattr doesn't matter.
    Apr 24 2017:
    . incorporate Eric's incremental diff
    . move cap_convert_nscap to setxattr and simplify its usage
    May 8, 2017:
    . fix leaking dentry refcount in cap_inode_getsecurity

    Signed-off-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Serge E. Hallyn
     

02 Aug, 2017

2 commits

  • Instead of a separate function, open-code the cap_elevated test, which
    lets us entirely remove bprm->cap_effective (to use the local "effective"
    variable instead), and more accurately examine euid/egid changes via the
    existing local "is_setid".

    The following LTP tests were run to validate the changes:

    # ./runltp -f syscalls -s cap
    # ./runltp -f securebits
    # ./runltp -f cap_bounds
    # ./runltp -f filecaps

    All kernel selftests for capabilities and exec continue to pass as well.

    Signed-off-by: Kees Cook
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • The commoncap implementation of the bprm_secureexec hook is the only LSM
    that depends on the final call to its bprm_set_creds hook (since it may
    be called for multiple files, it ignores bprm->called_set_creds). As a
    result, it cannot safely _clear_ bprm->secureexec since other LSMs may
    have set it. Instead, remove the bprm_secureexec hook by introducing a
    new flag to bprm specific to commoncap: cap_elevated. This is similar to
    cap_effective, but that is used for a specific subset of elevated
    privileges, and exists solely to track state from bprm_set_creds to
    bprm_secureexec. As such, it will be removed in the next patch.

    Here, set the new bprm->cap_elevated flag when setuid/setgid has happened
    from bprm_fill_uid() or fscapabilities have been prepared. This temporarily
    moves the bprm_secureexec hook to a static inline. The helper will be
    removed in the next patch; this makes the step easier to review and bisect,
    since this does not introduce any changes to inputs nor outputs to the
    "elevated privileges" calculation.

    The new flag is merged with the bprm->secureexec flag in setup_new_exec()
    since this marks the end of any further prepare_binprm() calls.

    Cc: Andy Lutomirski
    Signed-off-by: Kees Cook
    Reviewed-by: Andy Lutomirski
    Acked-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     

20 Jul, 2017

1 commit


06 Mar, 2017

1 commit


24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

24 Jan, 2017

3 commits


19 Jan, 2017

1 commit

  • I am still tired of having to find indirect ways to determine
    what security modules are active on a system. I have added
    /sys/kernel/security/lsm, which contains a comma separated
    list of the active security modules. No more groping around
    in /proc/filesystems or other clever hacks.

    Unchanged from previous versions except for being updated
    to the latest security next branch.

    Signed-off-by: Casey Schaufler
    Acked-by: John Johansen
    Acked-by: Paul Moore
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Casey Schaufler
     

08 Oct, 2016

1 commit

  • Right now, various places in the kernel check for the existence of
    getxattr, setxattr, and removexattr inode operations and directly call
    those operations. Switch to helper functions and test for the IOP_XATTR
    flag instead.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: James Morris
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

24 Jun, 2016

2 commits

  • If a process gets access to a mount from a different user
    namespace, that process should not be able to take advantage of
    setuid files or selinux entrypoints from that filesystem. Prevent
    this by treating mounts from other mount namespaces and those not
    owned by current_user_ns() or an ancestor as nosuid.

    This will make it safer to allow more complex filesystems to be
    mounted in non-root user namespaces.

    This does not remove the need for MNT_LOCK_NOSUID. The setuid,
    setgid, and file capability bits can no longer be abused if code in
    a user namespace were to clear nosuid on an untrusted filesystem,
    but this patch, by itself, is insufficient to protect the system
    from abuse of files that, when execed, would increase MAC privilege.

    As a more concrete explanation, any task that can manipulate a
    vfsmount associated with a given user namespace already has
    capabilities in that namespace and all of its descendents. If they
    can cause a malicious setuid, setgid, or file-caps executable to
    appear in that mount, then that executable will only allow them to
    elevate privileges in exactly the set of namespaces in which they
    are already privileges.

    On the other hand, if they can cause a malicious executable to
    appear with a dangerous MAC label, running it could change the
    caller's security context in a way that should not have been
    possible, even inside the namespace in which the task is confined.

    As a hardening measure, this would have made CVE-2014-5207 much
    more difficult to exploit.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Seth Forshee
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Andy Lutomirski
     
  • Capability sets attached to files must be ignored except in the
    user namespaces where the mounter is privileged, i.e. s_user_ns
    and its descendants. Otherwise a vector exists for gaining
    privileges in namespaces where a user is not already privileged.

    Add a new helper function, current_in_user_ns(), to test whether a user
    namespace is the same as or a descendant of another namespace.
    Use this helper to determine whether a file's capability set
    should be applied to the caps constructed during exec.

    --EWB Replaced in_userns with the simpler current_in_userns.

    Acked-by: Serge Hallyn
    Signed-off-by: Seth Forshee
    Signed-off-by: Eric W. Biederman

    Seth Forshee
     

18 May, 2016

1 commit

  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

23 Apr, 2016

1 commit

  • security_settime() uses a timespec, which is not year 2038 safe
    on 32bit systems. Thus this patch introduces the security_settime64()
    function with timespec64 type. We also convert the cap_settime() helper
    function to use the 64bit types.

    This patch then moves security_settime() to the header file as an
    inline helper function so that existing users can be iteratively
    converted.

    None of the existing hooks is using the timespec argument and therefor
    the patch is not making any functional changes.

    Cc: Serge Hallyn ,
    Cc: James Morris ,
    Cc: "Serge E. Hallyn" ,
    Cc: Paul Moore
    Cc: Stephen Smalley
    Cc: Kees Cook
    Cc: Prarit Bhargava
    Cc: Richard Cochran
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: James Morris
    Signed-off-by: Baolin Wang
    [jstultz: Reworded commit message]
    Signed-off-by: John Stultz

    Baolin Wang
     

11 Apr, 2016

1 commit


21 Jan, 2016

1 commit

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

05 Sep, 2015

2 commits

  • Per Andrew Morgan's request, add a securebit to allow admins to disable
    PR_CAP_AMBIENT_RAISE. This securebit will prevent processes from adding
    capabilities to their ambient set.

    For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than
    just disabling setting previously cleared bits.

    Signed-off-by: Andy Lutomirski
    Acked-by: Andrew G. Morgan
    Acked-by: Serge Hallyn
    Cc: Kees Cook
    Cc: Christoph Lameter
    Cc: Serge Hallyn
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Credit where credit is due: this idea comes from Christoph Lameter with
    a lot of valuable input from Serge Hallyn. This patch is heavily based
    on Christoph's patch.

    ===== The status quo =====

    On Linux, there are a number of capabilities defined by the kernel. To
    perform various privileged tasks, processes can wield capabilities that
    they hold.

    Each task has four capability masks: effective (pE), permitted (pP),
    inheritable (pI), and a bounding set (X). When the kernel checks for a
    capability, it checks pE. The other capability masks serve to modify
    what capabilities can be in pE.

    Any task can remove capabilities from pE, pP, or pI at any time. If a
    task has a capability in pP, it can add that capability to pE and/or pI.
    If a task has CAP_SETPCAP, then it can add any capability to pI, and it
    can remove capabilities from X.

    Tasks are not the only things that can have capabilities; files can also
    have capabilities. A file can have no capabilty information at all [1].
    If a file has capability information, then it has a permitted mask (fP)
    and an inheritable mask (fI) as well as a single effective bit (fE) [2].
    File capabilities modify the capabilities of tasks that execve(2) them.

    A task that successfully calls execve has its capabilities modified for
    the file ultimately being excecuted (i.e. the binary itself if that
    binary is ELF or for the interpreter if the binary is a script.) [3] In
    the capability evolution rules, for each mask Z, pZ represents the old
    value and pZ' represents the new value. The rules are:

    pP' = (X & fP) | (pI & fI)
    pI' = pI
    pE' = (fE ? pP' : 0)
    X is unchanged

    For setuid binaries, fP, fI, and fE are modified by a moderately
    complicated set of rules that emulate POSIX behavior. Similarly, if
    euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
    (primary, fP and fI usually end up being the full set). For nonroot
    users executing binaries with neither setuid nor file caps, fI and fP
    are empty and fE is false.

    As an extra complication, if you execute a process as nonroot and fE is
    set, then the "secure exec" rules are in effect: AT_SECURE gets set,
    LD_PRELOAD doesn't work, etc.

    This is rather messy. We've learned that making any changes is
    dangerous, though: if a new kernel version allows an unprivileged
    program to change its security state in a way that persists cross
    execution of a setuid program or a program with file caps, this
    persistent state is surprisingly likely to allow setuid or file-capped
    programs to be exploited for privilege escalation.

    ===== The problem =====

    Capability inheritance is basically useless.

    If you aren't root and you execute an ordinary binary, fI is zero, so
    your capabilities have no effect whatsoever on pP'. This means that you
    can't usefully execute a helper process or a shell command with elevated
    capabilities if you aren't root.

    On current kernels, you can sort of work around this by setting fI to
    the full set for most or all non-setuid executable files. This causes
    pP' = pI for nonroot, and inheritance works. No one does this because
    it's a PITA and it isn't even supported on most filesystems.

    If you try this, you'll discover that every nonroot program ends up with
    secure exec rules, breaking many things.

    This is a problem that has bitten many people who have tried to use
    capabilities for anything useful.

    ===== The proposed change =====

    This patch adds a fifth capability mask called the ambient mask (pA).
    pA does what most people expect pI to do.

    pA obeys the invariant that no bit can ever be set in pA if it is not
    set in both pP and pI. Dropping a bit from pP or pI drops that bit from
    pA. This ensures that existing programs that try to drop capabilities
    still do so, with a complication. Because capability inheritance is so
    broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
    then calling execve effectively drops capabilities. Therefore,
    setresuid from root to nonroot conditionally clears pA unless
    SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
    re-add bits to pA afterwards.

    The capability evolution rules are changed:

    pA' = (file caps or setuid or setgid ? 0 : pA)
    pP' = (X & fP) | (pI & fI) | pA'
    pI' = pI
    pE' = (fE ? pP' : pA')
    X is unchanged

    If you are nonroot but you have a capability, you can add it to pA. If
    you do so, your children get that capability in pA, pP, and pE. For
    example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
    automatically bind low-numbered ports. Hallelujah!

    Unprivileged users can create user namespaces, map themselves to a
    nonzero uid, and create both privileged (relative to their namespace)
    and unprivileged process trees. This is currently more or less
    impossible. Hallelujah!

    You cannot use pA to try to subvert a setuid, setgid, or file-capped
    program: if you execute any such program, pA gets cleared and the
    resulting evolution rules are unchanged by this patch.

    Users with nonzero pA are unlikely to unintentionally leak that
    capability. If they run programs that try to drop privileges, dropping
    privileges will still work.

    It's worth noting that the degree of paranoia in this patch could
    possibly be reduced without causing serious problems. Specifically, if
    we allowed pA to persist across executing non-pA-aware setuid binaries
    and across setresuid, then, naively, the only capabilities that could
    leak as a result would be the capabilities in pA, and any attacker
    *already* has those capabilities. This would make me nervous, though --
    setuid binaries that tried to privilege-separate might fail to do so,
    and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
    unexpected side effects. (Whether these unexpected side effects would
    be exploitable is an open question.) I've therefore taken the more
    paranoid route. We can revisit this later.

    An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
    ambient capabilities. I think that this would be annoying and would
    make granting otherwise unprivileged users minor ambient capabilities
    (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
    it is with this patch.

    ===== Footnotes =====

    [1] Files that are missing the "security.capability" xattr or that have
    unrecognized values for that xattr end up with has_cap set to false.
    The code that does that appears to be complicated for no good reason.

    [2] The libcap capability mask parsers and formatters are dangerously
    misleading and the documentation is flat-out wrong. fE is *not* a mask;
    it's a single bit. This has probably confused every single person who
    has tried to use file capabilities.

    [3] Linux very confusingly processes both the script and the interpreter
    if applicable, for reasons that elude me. The results from thinking
    about a script's file capabilities and/or setuid bits are mostly
    discarded.

    Preliminary userspace code is here, but it needs updating:
    https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

    Here is a test program that can be used to verify the functionality
    (from Christoph):

    /*
    * Test program for the ambient capabilities. This program spawns a shell
    * that allows running processes with a defined set of capabilities.
    *
    * (C) 2015 Christoph Lameter
    * Released under: GPL v3 or later.
    *
    *
    * Compile using:
    *
    * gcc -o ambient_test ambient_test.o -lcap-ng
    *
    * This program must have the following capabilities to run properly:
    * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
    *
    * A command to equip the binary with the right caps is:
    *
    * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
    *
    *
    * To get a shell with additional caps that can be inherited by other processes:
    *
    * ./ambient_test /bin/bash
    *
    *
    * Verifying that it works:
    *
    * From the bash spawed by ambient_test run
    *
    * cat /proc/$$/status
    *
    * and have a look at the capabilities.
    */

    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * Definitions from the kernel header files. These are going to be removed
    * when the /usr/include files have these defined.
    */
    #define PR_CAP_AMBIENT 47
    #define PR_CAP_AMBIENT_IS_SET 1
    #define PR_CAP_AMBIENT_RAISE 2
    #define PR_CAP_AMBIENT_LOWER 3
    #define PR_CAP_AMBIENT_CLEAR_ALL 4

    static void set_ambient_cap(int cap)
    {
    int rc;

    capng_get_caps_process();
    rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
    if (rc) {
    printf("Cannot add inheritable cap\n");
    exit(2);
    }
    capng_apply(CAPNG_SELECT_CAPS);

    /* Note the two 0s at the end. Kernel checks for these */
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
    perror("Cannot set cap");
    exit(1);
    }
    }

    int main(int argc, char **argv)
    {
    int rc;

    set_ambient_cap(CAP_NET_RAW);
    set_ambient_cap(CAP_NET_ADMIN);
    set_ambient_cap(CAP_SYS_NICE);

    printf("Ambient_test forking shell\n");
    if (execv(argv[1], argv + 1))
    perror("Cannot exec");

    return 0;
    }

    Signed-off-by: Christoph Lameter # Original author
    Signed-off-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

12 May, 2015

1 commit

  • Instead of using a vector of security operations
    with explicit, special case stacking of the capability
    and yama hooks use lists of hooks with capability and
    yama hooks included as appropriate.

    The security_operations structure is no longer required.
    Instead, there is a union of the function pointers that
    allows all the hooks lists to use a common mechanism for
    list management while retaining typing. Each module
    supplies an array describing the hooks it provides instead
    of a sparsely populated security_operations structure.
    The description includes the element that gets put on
    the hook list, avoiding the issues surrounding individual
    element allocation.

    The method for registering security modules is changed to
    reflect the information available. The method for removing
    a module, currently only used by SELinux, has also changed.
    It should be generic now, however if there are potential
    race conditions based on ordering of hook removal that needs
    to be addressed by the calling module.

    The security hooks are called from the lists and the first
    failure is returned.

    Signed-off-by: Casey Schaufler
    Acked-by: John Johansen
    Acked-by: Kees Cook
    Acked-by: Paul Moore
    Acked-by: Stephen Smalley
    Acked-by: Tetsuo Handa
    Signed-off-by: James Morris

    Casey Schaufler
     

16 Apr, 2015

1 commit