08 Oct, 2016

1 commit

  • Right now, various places in the kernel check for the existence of
    getxattr, setxattr, and removexattr inode operations and directly call
    those operations. Switch to helper functions and test for the IOP_XATTR
    flag instead.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: James Morris
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

24 Jun, 2016

2 commits

  • If a process gets access to a mount from a different user
    namespace, that process should not be able to take advantage of
    setuid files or selinux entrypoints from that filesystem. Prevent
    this by treating mounts from other mount namespaces and those not
    owned by current_user_ns() or an ancestor as nosuid.

    This will make it safer to allow more complex filesystems to be
    mounted in non-root user namespaces.

    This does not remove the need for MNT_LOCK_NOSUID. The setuid,
    setgid, and file capability bits can no longer be abused if code in
    a user namespace were to clear nosuid on an untrusted filesystem,
    but this patch, by itself, is insufficient to protect the system
    from abuse of files that, when execed, would increase MAC privilege.

    As a more concrete explanation, any task that can manipulate a
    vfsmount associated with a given user namespace already has
    capabilities in that namespace and all of its descendents. If they
    can cause a malicious setuid, setgid, or file-caps executable to
    appear in that mount, then that executable will only allow them to
    elevate privileges in exactly the set of namespaces in which they
    are already privileges.

    On the other hand, if they can cause a malicious executable to
    appear with a dangerous MAC label, running it could change the
    caller's security context in a way that should not have been
    possible, even inside the namespace in which the task is confined.

    As a hardening measure, this would have made CVE-2014-5207 much
    more difficult to exploit.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Seth Forshee
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Andy Lutomirski
     
  • Capability sets attached to files must be ignored except in the
    user namespaces where the mounter is privileged, i.e. s_user_ns
    and its descendants. Otherwise a vector exists for gaining
    privileges in namespaces where a user is not already privileged.

    Add a new helper function, current_in_user_ns(), to test whether a user
    namespace is the same as or a descendant of another namespace.
    Use this helper to determine whether a file's capability set
    should be applied to the caps constructed during exec.

    --EWB Replaced in_userns with the simpler current_in_userns.

    Acked-by: Serge Hallyn
    Signed-off-by: Seth Forshee
    Signed-off-by: Eric W. Biederman

    Seth Forshee
     

18 May, 2016

1 commit

  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

23 Apr, 2016

1 commit

  • security_settime() uses a timespec, which is not year 2038 safe
    on 32bit systems. Thus this patch introduces the security_settime64()
    function with timespec64 type. We also convert the cap_settime() helper
    function to use the 64bit types.

    This patch then moves security_settime() to the header file as an
    inline helper function so that existing users can be iteratively
    converted.

    None of the existing hooks is using the timespec argument and therefor
    the patch is not making any functional changes.

    Cc: Serge Hallyn ,
    Cc: James Morris ,
    Cc: "Serge E. Hallyn" ,
    Cc: Paul Moore
    Cc: Stephen Smalley
    Cc: Kees Cook
    Cc: Prarit Bhargava
    Cc: Richard Cochran
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: James Morris
    Signed-off-by: Baolin Wang
    [jstultz: Reworded commit message]
    Signed-off-by: John Stultz

    Baolin Wang
     

11 Apr, 2016

1 commit


21 Jan, 2016

1 commit

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

05 Sep, 2015

2 commits

  • Per Andrew Morgan's request, add a securebit to allow admins to disable
    PR_CAP_AMBIENT_RAISE. This securebit will prevent processes from adding
    capabilities to their ambient set.

    For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than
    just disabling setting previously cleared bits.

    Signed-off-by: Andy Lutomirski
    Acked-by: Andrew G. Morgan
    Acked-by: Serge Hallyn
    Cc: Kees Cook
    Cc: Christoph Lameter
    Cc: Serge Hallyn
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Credit where credit is due: this idea comes from Christoph Lameter with
    a lot of valuable input from Serge Hallyn. This patch is heavily based
    on Christoph's patch.

    ===== The status quo =====

    On Linux, there are a number of capabilities defined by the kernel. To
    perform various privileged tasks, processes can wield capabilities that
    they hold.

    Each task has four capability masks: effective (pE), permitted (pP),
    inheritable (pI), and a bounding set (X). When the kernel checks for a
    capability, it checks pE. The other capability masks serve to modify
    what capabilities can be in pE.

    Any task can remove capabilities from pE, pP, or pI at any time. If a
    task has a capability in pP, it can add that capability to pE and/or pI.
    If a task has CAP_SETPCAP, then it can add any capability to pI, and it
    can remove capabilities from X.

    Tasks are not the only things that can have capabilities; files can also
    have capabilities. A file can have no capabilty information at all [1].
    If a file has capability information, then it has a permitted mask (fP)
    and an inheritable mask (fI) as well as a single effective bit (fE) [2].
    File capabilities modify the capabilities of tasks that execve(2) them.

    A task that successfully calls execve has its capabilities modified for
    the file ultimately being excecuted (i.e. the binary itself if that
    binary is ELF or for the interpreter if the binary is a script.) [3] In
    the capability evolution rules, for each mask Z, pZ represents the old
    value and pZ' represents the new value. The rules are:

    pP' = (X & fP) | (pI & fI)
    pI' = pI
    pE' = (fE ? pP' : 0)
    X is unchanged

    For setuid binaries, fP, fI, and fE are modified by a moderately
    complicated set of rules that emulate POSIX behavior. Similarly, if
    euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
    (primary, fP and fI usually end up being the full set). For nonroot
    users executing binaries with neither setuid nor file caps, fI and fP
    are empty and fE is false.

    As an extra complication, if you execute a process as nonroot and fE is
    set, then the "secure exec" rules are in effect: AT_SECURE gets set,
    LD_PRELOAD doesn't work, etc.

    This is rather messy. We've learned that making any changes is
    dangerous, though: if a new kernel version allows an unprivileged
    program to change its security state in a way that persists cross
    execution of a setuid program or a program with file caps, this
    persistent state is surprisingly likely to allow setuid or file-capped
    programs to be exploited for privilege escalation.

    ===== The problem =====

    Capability inheritance is basically useless.

    If you aren't root and you execute an ordinary binary, fI is zero, so
    your capabilities have no effect whatsoever on pP'. This means that you
    can't usefully execute a helper process or a shell command with elevated
    capabilities if you aren't root.

    On current kernels, you can sort of work around this by setting fI to
    the full set for most or all non-setuid executable files. This causes
    pP' = pI for nonroot, and inheritance works. No one does this because
    it's a PITA and it isn't even supported on most filesystems.

    If you try this, you'll discover that every nonroot program ends up with
    secure exec rules, breaking many things.

    This is a problem that has bitten many people who have tried to use
    capabilities for anything useful.

    ===== The proposed change =====

    This patch adds a fifth capability mask called the ambient mask (pA).
    pA does what most people expect pI to do.

    pA obeys the invariant that no bit can ever be set in pA if it is not
    set in both pP and pI. Dropping a bit from pP or pI drops that bit from
    pA. This ensures that existing programs that try to drop capabilities
    still do so, with a complication. Because capability inheritance is so
    broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
    then calling execve effectively drops capabilities. Therefore,
    setresuid from root to nonroot conditionally clears pA unless
    SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
    re-add bits to pA afterwards.

    The capability evolution rules are changed:

    pA' = (file caps or setuid or setgid ? 0 : pA)
    pP' = (X & fP) | (pI & fI) | pA'
    pI' = pI
    pE' = (fE ? pP' : pA')
    X is unchanged

    If you are nonroot but you have a capability, you can add it to pA. If
    you do so, your children get that capability in pA, pP, and pE. For
    example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
    automatically bind low-numbered ports. Hallelujah!

    Unprivileged users can create user namespaces, map themselves to a
    nonzero uid, and create both privileged (relative to their namespace)
    and unprivileged process trees. This is currently more or less
    impossible. Hallelujah!

    You cannot use pA to try to subvert a setuid, setgid, or file-capped
    program: if you execute any such program, pA gets cleared and the
    resulting evolution rules are unchanged by this patch.

    Users with nonzero pA are unlikely to unintentionally leak that
    capability. If they run programs that try to drop privileges, dropping
    privileges will still work.

    It's worth noting that the degree of paranoia in this patch could
    possibly be reduced without causing serious problems. Specifically, if
    we allowed pA to persist across executing non-pA-aware setuid binaries
    and across setresuid, then, naively, the only capabilities that could
    leak as a result would be the capabilities in pA, and any attacker
    *already* has those capabilities. This would make me nervous, though --
    setuid binaries that tried to privilege-separate might fail to do so,
    and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
    unexpected side effects. (Whether these unexpected side effects would
    be exploitable is an open question.) I've therefore taken the more
    paranoid route. We can revisit this later.

    An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
    ambient capabilities. I think that this would be annoying and would
    make granting otherwise unprivileged users minor ambient capabilities
    (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
    it is with this patch.

    ===== Footnotes =====

    [1] Files that are missing the "security.capability" xattr or that have
    unrecognized values for that xattr end up with has_cap set to false.
    The code that does that appears to be complicated for no good reason.

    [2] The libcap capability mask parsers and formatters are dangerously
    misleading and the documentation is flat-out wrong. fE is *not* a mask;
    it's a single bit. This has probably confused every single person who
    has tried to use file capabilities.

    [3] Linux very confusingly processes both the script and the interpreter
    if applicable, for reasons that elude me. The results from thinking
    about a script's file capabilities and/or setuid bits are mostly
    discarded.

    Preliminary userspace code is here, but it needs updating:
    https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

    Here is a test program that can be used to verify the functionality
    (from Christoph):

    /*
    * Test program for the ambient capabilities. This program spawns a shell
    * that allows running processes with a defined set of capabilities.
    *
    * (C) 2015 Christoph Lameter
    * Released under: GPL v3 or later.
    *
    *
    * Compile using:
    *
    * gcc -o ambient_test ambient_test.o -lcap-ng
    *
    * This program must have the following capabilities to run properly:
    * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
    *
    * A command to equip the binary with the right caps is:
    *
    * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
    *
    *
    * To get a shell with additional caps that can be inherited by other processes:
    *
    * ./ambient_test /bin/bash
    *
    *
    * Verifying that it works:
    *
    * From the bash spawed by ambient_test run
    *
    * cat /proc/$$/status
    *
    * and have a look at the capabilities.
    */

    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * Definitions from the kernel header files. These are going to be removed
    * when the /usr/include files have these defined.
    */
    #define PR_CAP_AMBIENT 47
    #define PR_CAP_AMBIENT_IS_SET 1
    #define PR_CAP_AMBIENT_RAISE 2
    #define PR_CAP_AMBIENT_LOWER 3
    #define PR_CAP_AMBIENT_CLEAR_ALL 4

    static void set_ambient_cap(int cap)
    {
    int rc;

    capng_get_caps_process();
    rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
    if (rc) {
    printf("Cannot add inheritable cap\n");
    exit(2);
    }
    capng_apply(CAPNG_SELECT_CAPS);

    /* Note the two 0s at the end. Kernel checks for these */
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
    perror("Cannot set cap");
    exit(1);
    }
    }

    int main(int argc, char **argv)
    {
    int rc;

    set_ambient_cap(CAP_NET_RAW);
    set_ambient_cap(CAP_NET_ADMIN);
    set_ambient_cap(CAP_SYS_NICE);

    printf("Ambient_test forking shell\n");
    if (execv(argv[1], argv + 1))
    perror("Cannot exec");

    return 0;
    }

    Signed-off-by: Christoph Lameter # Original author
    Signed-off-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

12 May, 2015

1 commit

  • Instead of using a vector of security operations
    with explicit, special case stacking of the capability
    and yama hooks use lists of hooks with capability and
    yama hooks included as appropriate.

    The security_operations structure is no longer required.
    Instead, there is a union of the function pointers that
    allows all the hooks lists to use a common mechanism for
    list management while retaining typing. Each module
    supplies an array describing the hooks it provides instead
    of a sparsely populated security_operations structure.
    The description includes the element that gets put on
    the hook list, avoiding the issues surrounding individual
    element allocation.

    The method for registering security modules is changed to
    reflect the information available. The method for removing
    a module, currently only used by SELinux, has also changed.
    It should be generic now, however if there are potential
    race conditions based on ordering of hook removal that needs
    to be addressed by the calling module.

    The security hooks are called from the lists and the first
    failure is returned.

    Signed-off-by: Casey Schaufler
    Acked-by: John Johansen
    Acked-by: Kees Cook
    Acked-by: Paul Moore
    Acked-by: Stephen Smalley
    Acked-by: Tetsuo Handa
    Signed-off-by: James Morris

    Casey Schaufler
     

16 Apr, 2015

1 commit


26 Jan, 2015

1 commit


20 Nov, 2014

1 commit


24 Jul, 2014

2 commits

  • This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
    plus fixing it a different way...

    We found, when trying to run an application from an application which
    had dropped privs that the kernel does security checks on undefined
    capability bits. This was ESPECIALLY difficult to debug as those
    undefined bits are hidden from /proc/$PID/status.

    Consider a root application which drops all capabilities from ALL 4
    capability sets. We assume, since the application is going to set
    eff/perm/inh from an array that it will clear not only the defined caps
    less than CAP_LAST_CAP, but also the higher 28ish bits which are
    undefined future capabilities.

    The BSET gets cleared differently. Instead it is cleared one bit at a
    time. The problem here is that in security/commoncap.c::cap_task_prctl()
    we actually check the validity of a capability being read. So any task
    which attempts to 'read all things set in bset' followed by 'unset all
    things set in bset' will not even attempt to unset the undefined bits
    higher than CAP_LAST_CAP.

    So the 'parent' will look something like:
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: ffffffc000000000

    All of this 'should' be fine. Given that these are undefined bits that
    aren't supposed to have anything to do with permissions. But they do...

    So lets now consider a task which cleared the eff/perm/inh completely
    and cleared all of the valid caps in the bset (but not the invalid caps
    it couldn't read out of the kernel). We know that this is exactly what
    the libcap-ng library does and what the go capabilities library does.
    They both leave you in that above situation if you try to clear all of
    you capapabilities from all 4 sets. If that root task calls execve()
    the child task will pick up all caps not blocked by the bset. The bset
    however does not block bits higher than CAP_LAST_CAP. So now the child
    task has bits in eff which are not in the parent. These are
    'meaningless' undefined bits, but still bits which the parent doesn't
    have.

    The problem is now in cred_cap_issubset() (or any operation which does a
    subset test) as the child, while a subset for valid cap bits, is not a
    subset for invalid cap bits! So now we set durring commit creds that
    the child is not dumpable. Given it is 'more priv' than its parent. It
    also means the parent cannot ptrace the child and other stupidity.

    The solution here:
    1) stop hiding capability bits in status
    This makes debugging easier!

    2) stop giving any task undefined capability bits. it's simple, it you
    don't put those invalid bits in CAP_FULL_SET you won't get them in init
    and you won't get them in any other task either.
    This fixes the cap_issubset() tests and resulting fallout (which
    made the init task in a docker container untraceable among other
    things)

    3) mask out undefined bits when sys_capset() is called as it might use
    ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
    This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

    4) mask out undefined bit when we read a file capability off of disk as
    again likely all bits are set in the xattr for forward/backward
    compatibility.
    This lets 'setcap all+pe /bin/bash; /bin/bash' run

    Signed-off-by: Eric Paris
    Reviewed-by: Kees Cook
    Cc: Andrew Vagin
    Cc: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Kees Cook
    Cc: Steve Grubb
    Cc: Dan Walsh
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Eric Paris
     
  • In function cap_task_prctl(), we would allocate a credential
    unconditionally and then check if we support the requested function.
    If not we would release this credential with abort_creds() by using
    RCU method. But on some archs such as powerpc, the sys_prctl is heavily
    used to get/set the floating point exception mode. So the unnecessary
    allocating/releasing of credential not only introduce runtime overhead
    but also do cause OOM due to the RCU implementation.

    This patch removes abort_creds() from cap_task_prctl() by calling
    prepare_creds() only when we need to modify it.

    Reported-by: Kevin Hao
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Paul Moore
    Acked-by: Serge E. Hallyn
    Reviewed-by: Kees Cook
    Signed-off-by: James Morris

    Tetsuo Handa
     

31 Aug, 2013

2 commits

  • We allow task A to change B's nice level if it has a supserset of
    B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has
    CAP_SYS_NICE with respect to B - meaning it is root in the same
    namespace, or it created B's namespace.

    Signed-off-by: Serge Hallyn
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: Eric W. Biederman

    Serge Hallyn
     
  • As the capabilites and capability bounding set are per user namespace
    properties it is safe to allow changing them with just CAP_SETPCAP
    permission in the user namespace.

    Acked-by: Serge Hallyn
    Tested-by: Richard Weinberger
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Feb, 2013

1 commit


15 Dec, 2012

1 commit

  • Andy Lutomirski pointed out that the current behavior of allowing the
    owner of a user namespace to have all caps when that owner is not in a
    parent user namespace is wrong. Add a test to ensure the owner of a user
    namespace is in the parent of the user namespace to fix this bug.

    Thankfully this bug did not apply to the initial user namespace, keeping
    the mischief that can be caused by this bug quite small.

    This is bug was introduced in v3.5 by commit 783291e6900
    "Simplify the user_namespace by making userns->creator a kuid."
    But did not matter until the permisions required to create
    a user namespace were relaxed allowing a user namespace to be created
    inside of a user namespace.

    The bug made it possible for the owner of a user namespace to be
    present in a child user namespace. Since the owner of a user nameapce
    is granted all capabilities it became possible for users in a
    grandchild user namespace to have all privilges over their parent user
    namspace.

    Reorder the checks in cap_capable. This should make the common case
    faster and make it clear that nothing magic happens in the initial
    user namespace. The reordering is safe because cred->user_ns
    can only be in targ_ns or targ_ns->parent but not both.

    Add a comment a the top of the loop to make the logic of
    the code clear.

    Add a distinct variable ns that changes as we walk up
    the user namespace hierarchy to make it clear which variable
    is changing.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Jun, 2012

2 commits


24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

04 May, 2012

1 commit


03 May, 2012

2 commits


26 Apr, 2012

1 commit

  • - Transform userns->creator from a user_struct reference to a simple
    kuid_t, kgid_t pair.

    In cap_capable this allows the check to see if we are the creator of
    a namespace to become the classic suser style euid permission check.

    This allows us to remove the need for a struct cred in the mapping
    functions and still be able to dispaly the user namespace creators
    uid and gid as 0.

    - Remove the now unnecessary delayed_work in free_user_ns.

    All that is left for free_user_ns to do is to call kmem_cache_free
    and put_user_ns. Those functions can be called in any context
    so call them directly from free_user_ns removing the need for delayed work.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Apr, 2012

1 commit

  • Add missing "personality.h"
    security/commoncap.c: In function 'cap_bprm_set_creds':
    security/commoncap.c:510: error: 'PER_CLEAR_ON_SETID' undeclared (first use in this function)
    security/commoncap.c:510: error: (Each undeclared identifier is reported only once
    security/commoncap.c:510: error: for each function it appears in.)

    Signed-off-by: Jonghwan Choi
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Jonghwan Choi
     

18 Apr, 2012

1 commit

  • If a process increases permissions using fcaps all of the dangerous
    personality flags which are cleared for suid apps should also be cleared.
    Thus programs given priviledge with fcaps will continue to have address space
    randomization enabled even if the parent tried to disable it to make it
    easier to attack.

    Signed-off-by: Eric Paris
    Reviewed-by: Serge Hallyn
    Signed-off-by: James Morris

    Eric Paris
     

14 Apr, 2012

1 commit

  • With this change, calling
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
    disables privilege granting operations at execve-time. For example, a
    process will not be able to execute a setuid binary to change their uid
    or gid if this bit is set. The same is true for file capabilities.

    Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
    LSMs respect the requested behavior.

    To determine if the NO_NEW_PRIVS bit is set, a task may call
    prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
    It returns 1 if set and 0 if it is not set. If any of the arguments are
    non-zero, it will return -1 and set errno to -EINVAL.
    (PR_SET_NO_NEW_PRIVS behaves similarly.)

    This functionality is desired for the proposed seccomp filter patch
    series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
    system call behavior for itself and its child tasks without being
    able to impact the behavior of a more privileged task.

    Another potential use is making certain privileged operations
    unprivileged. For example, chroot may be considered "safe" if it cannot
    affect privileged tasks.

    Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
    set and AppArmor is in use. It is fixed in a subsequent patch.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris
    Acked-by: Kees Cook

    v18: updated change desc
    v17: using new define values as per 3.4
    Signed-off-by: James Morris

    Andy Lutomirski
     

08 Apr, 2012

2 commits


14 Feb, 2012

1 commit


15 Jan, 2012

1 commit

  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
    capabilities: remove __cap_full_set definition
    security: remove the security_netlink_recv hook as it is equivalent to capable()
    ptrace: do not audit capability check when outputing /proc/pid/stat
    capabilities: remove task_ns_* functions
    capabitlies: ns_capable can use the cap helpers rather than lsm call
    capabilities: style only - move capable below ns_capable
    capabilites: introduce new has_ns_capabilities_noaudit
    capabilities: call has_ns_capability from has_capability
    capabilities: remove all _real_ interfaces
    capabilities: introduce security_capable_noaudit
    capabilities: reverse arguments to security_capable
    capabilities: remove the task from capable LSM hook entirely
    selinux: sparse fix: fix several warnings in the security server cod
    selinux: sparse fix: fix warnings in netlink code
    selinux: sparse fix: eliminate warnings for selinuxfs
    selinux: sparse fix: declare selinux_disable() in security.h
    selinux: sparse fix: move selinux_complete_init
    selinux: sparse fix: make selinux_secmark_refcount static
    SELinux: Fix RCU deref check warning in sel_netport_insert()

    Manually fix up a semantic mis-merge wrt security_netlink_recv():

    - the interface was removed in commit fd7784615248 ("security: remove
    the security_netlink_recv hook as it is equivalent to capable()")

    - a new user of it appeared in commit a38f7907b926 ("crypto: Add
    userspace configuration API")

    causing no automatic merge conflict, but Eric Paris pointed out the
    issue.

    Linus Torvalds
     

06 Jan, 2012

2 commits

  • Once upon a time netlink was not sync and we had to get the effective
    capabilities from the skb that was being received. Today we instead get
    the capabilities from the current task. This has rendered the entire
    purpose of the hook moot as it is now functionally equivalent to the
    capable() call.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The capabilities framework is based around credentials, not necessarily the
    current task. Yet we still passed the current task down into LSMs from the
    security_capable() LSM hook as if it was a meaningful portion of the security
    decision. This patch removes the 'generic' passing of current and instead
    forces individual LSMs to use current explicitly if they think it is
    appropriate. In our case those LSMs are SELinux and AppArmor.

    I believe the AppArmor use of current is incorrect, but that is wholely
    unrelated to this patch. This patch does not change what AppArmor does, it
    just makes it clear in the AppArmor code that it is doing it.

    The SELinux code still uses current in it's audit message, which may also be
    wrong and needs further investigation. Again this is NOT a change, it may
    have always been wrong, this patch just makes it clear what is happening.

    Signed-off-by: Eric Paris

    Eric Paris
     

16 Aug, 2011

1 commit


12 Aug, 2011

1 commit

  • A task (when !SECURE_NOROOT) which executes a setuid-root binary will
    obtain root privileges while executing that binary. If the binary also
    has effective capabilities set, then only those capabilities will be
    granted. The rationale is that the same binary can carry both setuid-root
    and the minimal file capability set, so that on a filesystem not
    supporting file caps the binary can still be executed with privilege,
    while on a filesystem supporting file caps it will run with minimal
    privilege.

    This special case currently does NOT happen if there are file capabilities
    but no effective capabilities. Since capability-aware programs can very
    well start with empty pE but populated pP and move those caps to pE when
    needed. In other words, if the file has file capabilities but NOT
    effective capabilities, then we should do the same thing as if there
    were file capabilities, and not grant full root privileges.

    This patchset does that.

    (Changelog by Serge Hallyn).

    Signed-off-by: Zhi Li
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Zhi Li
     

04 Apr, 2011

1 commit

  • When the global init task is exec'd we have special case logic to make sure
    the pE is not reduced. There is no reason for this. If init wants to drop
    it's pE is should be allowed to do so. Remove this special logic.

    Signed-off-by: Eric Paris
    Acked-by: Serge Hallyn
    Acked-by: David Howells
    Acked-by: Andrew G. Morgan
    Signed-off-by: James Morris

    Eric Paris
     

24 Mar, 2011

2 commits

  • ptrace is allowed to tasks in the same user namespace according to the
    usual rules (i.e. the same rules as for two tasks in the init user
    namespace). ptrace is also allowed to a user namespace to which the
    current task the has CAP_SYS_PTRACE capability.

    Changelog:
    Dec 31: Address feedback by Eric:
    . Correct ptrace uid check
    . Rename may_ptrace_ns to ptrace_capable
    . Also fix the cap_ptrace checks.
    Jan 1: Use const cred struct
    Jan 11: use task_ns_capable() in place of ptrace_capable().
    Feb 23: same_or_ancestore_user_ns() was not an appropriate
    check to constrain cap_issubset. Rather, cap_issubset()
    only is meaningful when both capsets are in the same
    user_ns.

    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • - Introduce ns_capable to test for a capability in a non-default
    user namespace.
    - Teach cap_capable to handle capabilities in a non-default
    user namespace.

    The motivation is to get to the unprivileged creation of new
    namespaces. It looks like this gets us 90% of the way there, with
    only potential uid confusion issues left.

    I still need to handle getting all caps after creation but otherwise I
    think I have a good starter patch that achieves all of your goals.

    Changelog:
    11/05/2010: [serge] add apparmor
    12/14/2010: [serge] fix capabilities to created user namespaces
    Without this, if user serge creates a user_ns, he won't have
    capabilities to the user_ns he created. THis is because we
    were first checking whether his effective caps had the caps
    he needed and returning -EPERM if not, and THEN checking whether
    he was the creator. Reverse those checks.
    12/16/2010: [serge] security_real_capable needs ns argument in !security case
    01/11/2011: [serge] add task_ns_capable helper
    01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion
    02/16/2011: [serge] fix a logic bug: the root user is always creator of
    init_user_ns, but should not always have capabilities to
    it! Fix the check in cap_capable().
    02/21/2011: Add the required user_ns parameter to security_capable,
    fixing a compile failure.
    02/23/2011: Convert some macros to functions as per akpm comments. Some
    couldn't be converted because we can't easily forward-declare
    them (they are inline if !SECURITY, extern if SECURITY). Add
    a current_user_ns function so we can use it in capability.h
    without #including cred.h. Move all forward declarations
    together to the top of the #ifdef __KERNEL__ section, and use
    kernel-doc format.
    02/23/2011: Per dhowells, clean up comment in cap_capable().
    02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable.

    (Original written and signed off by Eric; latest, modified version
    acked by him)

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: export current_user_ns() for ecryptfs]
    [serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn