14 Jan, 2020

1 commit

  • Time Namespace isolates clock values.

    The kernel provides access to several clocks CLOCK_REALTIME,
    CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

    CLOCK_REALTIME
    System-wide clock that measures real (i.e., wall-clock) time.

    CLOCK_MONOTONIC
    Clock that cannot be set and represents monotonic time since
    some unspecified starting point.

    CLOCK_BOOTTIME
    Identical to CLOCK_MONOTONIC, except it also includes any time
    that the system is suspended.

    For many users, the time namespace means the ability to changes date and
    time in a container (CLOCK_REALTIME). Providing per namespace notions of
    CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
    value.

    But in the context of checkpoint/restore functionality, monotonic and
    boottime clocks become interesting. Both clocks are monotonic with
    unspecified starting points. These clocks are widely used to measure time
    slices and set timers. After restoring or migrating processes, it has to be
    guaranteed that they never go backward. In an ideal case, the behavior of
    these clocks should be the same as for a case when a whole system is
    suspended. All this means that it is required to set CLOCK_MONOTONIC and
    CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
    offsets for clocks.

    A time namespace is similar to a pid namespace in the way how it is
    created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
    but doesn't set it to the current process. Then all children of the process
    will be born in the new time namespace, or a process can use the setns()
    system call to join a namespace.

    This scheme allows setting clock offsets for a namespace, before any
    processes appear in it.

    All available clone flags have been used, so CLONE_NEWTIME uses the highest
    bit of CSIGNAL. It means that it can be used only with the unshare() and
    the clone3() system calls.

    [ tglx: Adjusted paragraph about clone3() to reality and massaged the
    changelog a bit. ]

    Co-developed-by: Dmitry Safonov
    Signed-off-by: Andrei Vagin
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Link: https://criu.org/Time_namespace
    Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
    Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com

    Andrei Vagin
     

27 Jun, 2019

2 commits

  • Move the user and user-session keyrings to the user_namespace struct rather
    than pinning them from the user_struct struct. This prevents these
    keyrings from propagating across user-namespaces boundaries with regard to
    the KEY_SPEC_* flags, thereby making them more useful in a containerised
    environment.

    The issue is that a single user_struct may be represent UIDs in several
    different namespaces.

    The way the patch does this is by attaching a 'register keyring' in each
    user_namespace and then sticking the user and user-session keyrings into
    that. It can then be searched to retrieve them.

    Signed-off-by: David Howells
    cc: Jann Horn

    David Howells
     
  • Keyring names are held in a single global list that any process can pick
    from by means of keyctl_join_session_keyring (provided the keyring grants
    Search permission). This isn't very container friendly, however.

    Make the following changes:

    (1) Make default session, process and thread keyring names begin with a
    '.' instead of '_'.

    (2) Keyrings whose names begin with a '.' aren't added to the list. Such
    keyrings are system specials.

    (3) Replace the global list with per-user_namespace lists. A keyring adds
    its name to the list for the user_namespace that it is currently in.

    (4) When a user_namespace is deleted, it just removes itself from the
    keyring name list.

    The global keyring_name_lock is retained for accessing the name lists.
    This allows (4) to work.

    This can be tested by:

    # keyctl newring foo @s
    995906392
    # unshare -U
    $ keyctl show
    ...
    995906392 --alswrv 65534 65534 \_ keyring: foo
    ...
    $ keyctl session foo
    Joined session keyring: 935622349

    As can be seen, a new session keyring was created.

    The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
    employing this feature.

    Signed-off-by: David Howells
    cc: Eric W. Biederman

    David Howells
     

17 Nov, 2017

1 commit

  • Pull user namespace update from Eric Biederman:
    "The only change that is production ready this round is the work to
    increase the number of uid and gid mappings a user namespace can
    support from 5 to 340.

    This code was carefully benchmarked and it was confirmed that in the
    existing cases the performance remains the same. In the worst case
    with 340 mappings an cache cold stat times go from 158ns to 248ns.
    That is noticable but still quite small, and only the people who are
    doing crazy things pay the cost.

    This work uncovered some documentation and cleanup opportunities in
    the mapping code, and patches to make those cleanups and improve the
    documentation will be coming in the next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Simplify insert_extent
    userns: Make map_id_down a wrapper for map_id_range_down
    userns: Don't read extents twice in m_start
    userns: Simplify the user and group mapping functions
    userns: Don't special case a count of 0
    userns: bump idmap limits to 340
    userns: use union in {g,u}idmap struct

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Nov, 2017

2 commits

  • There are quite some use cases where users run into the current limit for
    {g,u}id mappings. Consider a user requesting us to map everything but 999, and
    1001 for a given range of 1000000000 with a sub{g,u}id layout of:

    some-user:100000:1000000000
    some-user:999:1
    some-user:1000:1
    some-user:1001:1
    some-user:1002:1

    This translates to:

    MAPPING-TYPE | CONTAINER | HOST | RANGE |
    -------------|-----------|---------|-----------|
    uid | 999 | 999 | 1 |
    uid | 1001 | 1001 | 1 |
    uid | 0 | 1000000 | 999 |
    uid | 1000 | 1001000 | 1 |
    uid | 1002 | 1001002 | 999998998 |
    ------------------------------------------------
    gid | 999 | 999 | 1 |
    gid | 1001 | 1001 | 1 |
    gid | 0 | 1000000 | 999 |
    gid | 1000 | 1001000 | 1 |
    gid | 1002 | 1001002 | 999998998 |

    which is already the current limit.

    As discussed at LPC simply bumping the number of limits is not going to work
    since this would mean that struct uid_gid_map won't fit into a single cache-line
    anymore thereby regressing performance for the base-cases. The same problem
    seems to arise when using a single pointer. So the idea is to use

    struct uid_gid_extent {
    u32 first;
    u32 lower_first;
    u32 count;
    };

    struct uid_gid_map { /* 64 bytes -- 1 cache line */
    u32 nr_extents;
    union {
    struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
    struct {
    struct uid_gid_extent *forward;
    struct uid_gid_extent *reverse;
    };
    };
    };

    For the base cases we will only use the struct uid_gid_extent extent member. If
    we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
    kmalloc() which means we can have a maximum of 340 mappings
    (340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
    pointers "forward" and "reverse". The forward pointer points to an array sorted
    by "first" and the reverse pointer points to an array sorted by "lower_first".
    We can then perform binary search on those arrays.

    Performance Testing:
    When Eric introduced the extent-based struct uid_gid_map approach he measured
    the performanc impact of his idmap changes:

    > My benchmark consisted of going to single user mode where nothing else was
    > running. On an ext4 filesystem opening 1,000,000 files and looping through all
    > of the files 1000 times and calling fstat on the individuals files. This was
    > to ensure I was benchmarking stat times where the inodes were in the kernels
    > cache, but the inode values were not in the processors cache. My results:

    > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    I used an identical approach on my laptop. Here's a thorough description of what
    I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
    booted into single user mode and used an ext4 filesystem to open/create
    1,000,000 files. Then I looped through all of the files calling fstat() on each
    of them 1000 times and calculated the mean fstat() time for a single file. (The
    test program can be found below.)

    Here are the results. For fun, I compared the first version of my patch which
    scaled linearly with the new version of the patch:

    | # MAPPINGS | PATCH-V1 | PATCH-NEW |
    |--------------|------------|-----------|
    | 0 mappings | 158 ns | 158 ns |
    | 1 mappings | 164 ns | 157 ns |
    | 2 mappings | 170 ns | 158 ns |
    | 3 mappings | 175 ns | 161 ns |
    | 5 mappings | 187 ns | 165 ns |
    | 10 mappings | 218 ns | 199 ns |
    | 50 mappings | 528 ns | 218 ns |
    | 100 mappings | 980 ns | 229 ns |
    | 200 mappings | 1880 ns | 239 ns |
    | 300 mappings | 2760 ns | 240 ns |
    | 340 mappings | not tested | 248 ns |

    Here's the test program I used. I asked Eric what he did and this is a more
    "advanced" implementation of the idea. It's pretty straight-forward:

    #define __GNU_SOURCE
    #define __STDC_FORMAT_MACROS
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    int ret;
    size_t i, k;
    int fd[1000000];
    int times[1000];
    char pathname[4096];
    struct stat st;
    struct timeval t1, t2;
    uint64_t time_in_mcs;
    uint64_t sum = 0;

    if (argc != 2) {
    fprintf(stderr, "Please specify a directory where to create "
    "the test files\n");
    exit(EXIT_FAILURE);
    }

    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
    sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
    fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
    if (fd[i] < 0) {
    ssize_t j;
    for (j = i; j >= 0; j--)
    close(fd[j]);
    exit(EXIT_FAILURE);
    }
    }

    for (k = 0; k < 1000; k++) {
    ret = gettimeofday(&t1, NULL);
    if (ret < 0)
    goto close_all;

    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
    ret = fstat(fd[i], &st);
    if (ret < 0)
    goto close_all;
    }

    ret = gettimeofday(&t2, NULL);
    if (ret < 0)
    goto close_all;

    time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
    (1000000 * t1.tv_sec + t1.tv_usec);
    printf("Total time in micro seconds: %" PRIu64 "\n",
    time_in_mcs);
    printf("Total time in nanoseconds: %" PRIu64 "\n",
    time_in_mcs * 1000);
    printf("Time per file in nanoseconds: %" PRIu64 "\n",
    (time_in_mcs * 1000) / 1000000);
    times[k] = (time_in_mcs * 1000) / 1000000;
    }

    close_all:
    for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
    close(fd[i]);

    if (ret < 0)
    exit(EXIT_FAILURE);

    for (k = 0; k < 1000; k++) {
    sum += times[k];
    }

    printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);

    exit(EXIT_SUCCESS);;
    }

    Signed-off-by: Christian Brauner
    CC: Serge Hallyn
    CC: Eric Biederman
    Signed-off-by: Eric W. Biederman

    Christian Brauner
     
  • - Add a struct containing two pointer to extents and wrap both the static extent
    array and the struct into a union. This is done in preparation for bumping the
    {g,u}idmap limits for user namespaces.
    - Add brackets around anonymous union when using designated initializers to
    initialize members in order to please gcc
    Signed-off-by: Eric W. Biederman

    Christian Brauner
     

12 Sep, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "Life has been busy and I have not gotten half as much done this round
    as I would have liked. I delayed it so that a minor conflict
    resolution with the mips tree could spend a little time in linux-next
    before I sent this pull request.

    This includes two long delayed user namespace changes from Kirill
    Tkhai. It also includes a very useful change from Serge Hallyn that
    allows the security capability attribute to be used inside of user
    namespaces. The practical effect of this is people can now untar
    tarballs and install rpms in user namespaces. It had been suggested to
    generalize this and encode some of the namespace information
    information in the xattr name. Upon close inspection that makes the
    things that should be hard easy and the things that should be easy
    more expensive.

    Then there is my bugfix/cleanup for signal injection that removes the
    magic encoding of the siginfo union member from the kernel internal
    si_code. The mips folks reported the case where I had used FPE_FIXME
    me is impossible so I have remove FPE_FIXME from mips, while at the
    same time including a return statement in that case to keep gcc from
    complaining about unitialized variables.

    I almost finished the work to get make copy_siginfo_to_user a trivial
    copy to user. The code is available at:

    git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

    But I did not have time/energy to get the code posted and reviewed
    before the merge window opened.

    I was able to see that the security excuse for just copying fields
    that we know are initialized doesn't work in practice there are buggy
    initializations that don't initialize the proper fields in siginfo. So
    we still sometimes copy unitialized data to userspace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Introduce v3 namespaced file capabilities
    mips/signal: In force_fcr31_sig return in the impossible case
    signal: Remove kernel interal si_code magic
    fcntl: Don't use ambiguous SIG_POLL si_codes
    prctl: Allow local CAP_SYS_ADMIN changing exe_file
    security: Use user_namespace::level to avoid redundant iterations in cap_capable()
    userns,pidns: Verify the userns for new pid namespaces
    signal/testing: Don't look for __SI_FAULT in userspace
    signal/mips: Document a conflict with SI_USER with SIGFPE
    signal/sparc: Document a conflict with SI_USER with SIGFPE
    signal/ia64: Document a conflict with SI_USER with SIGFPE
    signal/alpha: Document a conflict with SI_USER for SIGTRAP

    Linus Torvalds
     

20 Jul, 2017

1 commit

  • It is pointless and confusing to allow a pid namespace hierarchy and
    the user namespace hierarchy to get out of sync. The owner of a child
    pid namespace should be the owner of the parent pid namespace or
    a descendant of the owner of the parent pid namespace.

    Otherwise it is possible to construct scenarios where a process has a
    capability over a parent pid namespace but does not have the
    capability over a child pid namespace. Which confusingly makes
    permission checks non-transitive.

    It requires use of setns into a pid namespace (but not into a user
    namespace) to create such a scenario.

    Add the function in_userns to help in making this determination.

    v2: Optimized in_userns by using level as suggested
    by: Kirill Tkhai

    Ref: 49f4d8b93ccf ("pidns: Capture the user namespace and filter ns_last_pid")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Jul, 2017

1 commit

  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

07 Mar, 2017

1 commit

  • Always increment/decrement ucount->count under the ucounts_lock. The
    increments are there already and moving the decrements there means the
    locking logic of the code is simpler. This simplification in the
    locking logic fixes a race between put_ucounts and get_ucounts that
    could result in a use-after-free because the count could go zero then
    be found by get_ucounts and then be freed by put_ucounts.

    A bug presumably this one was found by a combination of syzkaller and
    KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
    spotted the race in the code.

    Cc: stable@vger.kernel.org
    Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user")
    Reported-by: JongHwan Kim
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Mar, 2017

2 commits


02 Mar, 2017

1 commit

  • We are going to remove the following header inclusions from :

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    Fix up a single .h file that got hold of via one of these headers.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

24 Jan, 2017

1 commit

  • This patchset converts inotify to using the newly introduced
    per-userns sysctl infrastructure.

    Currently the inotify instances/watches are being accounted in the
    user_struct structure. This means that in setups where multiple
    users in unprivileged containers map to the same underlying
    real user (i.e. pointing to the same user_struct) the inotify limits
    are going to be shared as well, allowing one user(or application) to exhaust
    all others limits.

    Fix this by switching the inotify sysctls to using the
    per-namespace/per-user limits. This will allow the server admin to
    set sensible global limits, which can further be tuned inside every
    individual user namespace. Additionally, in order to preserve the
    sysctl ABI make the existing inotify instances/watches sysctls
    modify the values of the initial user namespace.

    Signed-off-by: Nikolay Borisov
    Acked-by: Jan Kara
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Nikolay Borisov
     

23 Sep, 2016

2 commits

  • From: Andrey Vagin

    Each namespace has an owning user namespace and now there is not way
    to discover these relationships.

    Pid and user namepaces are hierarchical. There is no way to discover
    parent-child relationships too.

    Why we may want to know relationships between namespaces?

    One use would be visualization, in order to understand the running
    system. Another would be to answer the question: what capability does
    process X have to perform operations on a resource governed by namespace
    Y?

    One more use-case (which usually called abnormal) is checkpoint/restart.
    In CRIU we are going to dump and restore nested namespaces.

    There [1] was a discussion about which interface to choose to determing
    relationships between namespaces.

    Eric suggested to add two ioctl-s [2]:
    > Grumble, Grumble. I think this may actually a case for creating ioctls
    > for these two cases. Now that random nsfs file descriptors are bind
    > mountable the original reason for using proc files is not as pressing.
    >
    > One ioctl for the user namespace that owns a file descriptor.
    > One ioctl for the parent namespace of a namespace file descriptor.

    Here is an implementaions of these ioctl-s.

    $ man man7/namespaces.7
    ...
    Since Linux 4.X, the following ioctl(2) calls are supported for
    namespace file descriptors. The correct syntax is:

    fd = ioctl(ns_fd, ioctl_type);

    where ioctl_type is one of the following:

    NS_GET_USERNS
    Returns a file descriptor that refers to an owning user names‐
    pace.

    NS_GET_PARENT
    Returns a file descriptor that refers to a parent namespace.
    This ioctl(2) can be used for pid and user namespaces. For
    user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
    meaning.

    In addition to generic ioctl(2) errors, the following specific ones
    can occur:

    EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

    EPERM The requested namespace is outside of the current namespace
    scope.

    [1] https://lkml.org/lkml/2016/7/6/158
    [2] https://lkml.org/lkml/2016/7/9/101

    Changes for v2:
    * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
    outside of the init namespace, so we can return EPERM in this case too.
    > The fewer special cases the easier the code is to get
    > correct, and the easier it is to read. // Eric

    Changes for v3:
    * rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Cc: "Eric W. Biederman"
    Cc: James Bottomley
    Cc: "Michael Kerrisk (man-pages)"
    Cc: "W. Trevor King"
    Cc: Alexander Viro
    Cc: Serge Hallyn

    Eric W. Biederman
     
  • Return -EPERM if an owning user namespace is outside of a process
    current user namespace.

    v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
    v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
    grabs a reference.

    Acked-by: Serge Hallyn
    Signed-off-by: Andrei Vagin
    Signed-off-by: Eric W. Biederman

    Andrey Vagin
     

31 Aug, 2016

1 commit


09 Aug, 2016

9 commits


08 Aug, 2016

1 commit


24 Jun, 2016

1 commit

  • Capability sets attached to files must be ignored except in the
    user namespaces where the mounter is privileged, i.e. s_user_ns
    and its descendants. Otherwise a vector exists for gaining
    privileges in namespaces where a user is not already privileged.

    Add a new helper function, current_in_user_ns(), to test whether a user
    namespace is the same as or a descendant of another namespace.
    Use this helper to determine whether a file's capability set
    should be applied to the caps constructed during exec.

    --EWB Replaced in_userns with the simpler current_in_userns.

    Acked-by: Serge Hallyn
    Signed-off-by: Seth Forshee
    Signed-off-by: Eric W. Biederman

    Seth Forshee
     

18 Dec, 2014

1 commit

  • Pull user namespace related fixes from Eric Biederman:
    "As these are bug fixes almost all of thes changes are marked for
    backporting to stable.

    The first change (implicitly adding MNT_NODEV on remount) addresses a
    regression that was created when security issues with unprivileged
    remount were closed. I go on to update the remount test to make it
    easy to detect if this issue reoccurs.

    Then there are a handful of mount and umount related fixes.

    Then half of the changes deal with the a recently discovered design
    bug in the permission checks of gid_map. Unix since the beginning has
    allowed setting group permissions on files to less than the user and
    other permissions (aka ---rwx---rwx). As the unix permission checks
    stop as soon as a group matches, and setgroups allows setting groups
    that can not later be dropped, results in a situtation where it is
    possible to legitimately use a group to assign fewer privileges to a
    process. Which means dropping a group can increase a processes
    privileges.

    The fix I have adopted is that gid_map is now no longer writable
    without privilege unless the new file /proc/self/setgroups has been
    set to permanently disable setgroups.

    The bulk of user namespace using applications even the applications
    using applications using user namespaces without privilege remain
    unaffected by this change. Unfortunately this ix breaks a couple user
    space applications, that were relying on the problematic behavior (one
    of which was tools/selftests/mount/unprivileged-remount-test.c).

    To hopefully prevent needing a regression fix on top of my security
    fix I rounded folks who work with the container implementations mostly
    like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn

    > I tested this with Sandstorm. It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski # breaks things as designed :("

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Unbreak the unprivileged remount tests
    userns; Correct the comment in map_write
    userns: Allow setting gid_maps without privilege when setgroups is disabled
    userns: Add a knob to disable setgroups on a per user namespace basis
    userns: Rename id_map_mutex to userns_state_mutex
    userns: Only allow the creator of the userns unprivileged mappings
    userns: Check euid no fsuid when establishing an unprivileged uid mapping
    userns: Don't allow unprivileged creation of gid mappings
    userns: Don't allow setgroups until a gid mapping has been setablished
    userns: Document what the invariant required for safe unprivileged mappings.
    groups: Consolidate the setgroups permission checks
    mnt: Clear mnt_expire during pivot_root
    mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
    mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
    umount: Do not allow unmounting rootfs.
    umount: Disallow unprivileged mount force
    mnt: Update unprivileged remount test
    mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount

    Linus Torvalds
     

12 Dec, 2014

1 commit

  • - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Dec, 2014

1 commit

  • setgroups is unique in not needing a valid mapping before it can be called,
    in the case of setgroups(0, NULL) which drops all supplemental groups.

    The design of the user namespace assumes that CAP_SETGID can not actually
    be used until a gid mapping is established. Therefore add a helper function
    to see if the user namespace gid mapping has been established and call
    that function in the setgroups permission check.

    This is part of the fix for CVE-2014-8989, being able to drop groups
    without privilege using user namespaces.

    Cc: stable@vger.kernel.org
    Reviewed-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Dec, 2014

1 commit


09 Aug, 2014

1 commit

  • proc_uid_seq_operations, proc_gid_seq_operations and
    proc_projid_seq_operations are only called in proc_id_map_open with
    seq_open as const struct seq_operations so we can constify the 3
    structures and update proc_id_map_open prototype.

    text data bss dec hex filename
    6817 404 1984 9205 23f5 kernel/user_namespace.o-before
    6913 308 1984 9205 23f5 kernel/user_namespace.o-after

    Signed-off-by: Fabian Frederick
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

24 Sep, 2013

1 commit

  • Add support for per-user_namespace registers of persistent per-UID kerberos
    caches held within the kernel.

    This allows the kerberos cache to be retained beyond the life of all a user's
    processes so that the user's cron jobs can work.

    The kerberos cache is envisioned as a keyring/key tree looking something like:

    struct user_namespace
    \___ .krb_cache keyring - The register
    \___ _krb.0 keyring - Root's Kerberos cache
    \___ _krb.5000 keyring - User 5000's Kerberos cache
    \___ _krb.5001 keyring - User 5001's Kerberos cache
    \___ tkt785 big_key - A ccache blob
    \___ tkt12345 big_key - Another ccache blob

    Or possibly:

    struct user_namespace
    \___ .krb_cache keyring - The register
    \___ _krb.0 keyring - Root's Kerberos cache
    \___ _krb.5000 keyring - User 5000's Kerberos cache
    \___ _krb.5001 keyring - User 5001's Kerberos cache
    \___ tkt785 keyring - A ccache
    \___ krbtgt/REDHAT.COM@REDHAT.COM big_key
    \___ http/REDHAT.COM@REDHAT.COM user
    \___ afs/REDHAT.COM@REDHAT.COM user
    \___ nfs/REDHAT.COM@REDHAT.COM user
    \___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
    \___ http/KERNEL.ORG@KERNEL.ORG big_key

    What goes into a particular Kerberos cache is entirely up to userspace. Kernel
    support is limited to giving you the Kerberos cache keyring that you want.

    The user asks for their Kerberos cache by:

    krb_cache = keyctl_get_krbcache(uid, dest_keyring);

    The uid is -1 or the user's own UID for the user's own cache or the uid of some
    other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to
    mess with the cache.

    The cache returned is a keyring named "_krb." that the possessor can read,
    search, clear, invalidate, unlink from and add links to. Active LSMs get a
    chance to rule on whether the caller is permitted to make a link.

    Each uid's cache keyring is created when it first accessed and is given a
    timeout that is extended each time this function is called so that the keyring
    goes away after a while. The timeout is configurable by sysctl but defaults to
    three days.

    Each user_namespace struct gets a lazily-created keyring that serves as the
    register. The cache keyrings are added to it. This means that standard key
    search and garbage collection facilities are available.

    The user_namespace struct's register goes away when it does and anything left
    in it is then automatically gc'd.

    Signed-off-by: David Howells
    Tested-by: Simo Sorce
    cc: Serge E. Hallyn
    cc: Eric W. Biederman

    David Howells
     

08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

27 Aug, 2013

1 commit

  • Rely on the fact that another flavor of the filesystem is already
    mounted and do not rely on state in the user namespace.

    Verify that the mounted filesystem is not covered in any significant
    way. I would love to verify that the previously mounted filesystem
    has no mounts on top but there are at least the directories
    /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
    for other filesystems to mount on top of.

    Refactor the test into a function named fs_fully_visible and call that
    function from the mount routines of proc and sysfs. This makes this
    test local to the filesystems involved and the results current of when
    the mounts take place, removing a weird threading of the user
    namespace, the mount namespace and the filesystems themselves.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Aug, 2013

1 commit


27 Mar, 2013

1 commit

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Jan, 2013

1 commit

  • When freeing a deeply nested user namespace free_user_ns calls
    put_user_ns on it's parent which may in turn call free_user_ns again.
    When -fno-optimize-sibling-calls is passed to gcc one stack frame per
    user namespace is left on the stack, potentially overflowing the
    kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
    so we can't count on gcc to optimize this code.

    Remove struct kref and use a plain atomic_t. Making the code more
    flexible and easier to comprehend. Make the loop in free_user_ns
    explict to guarantee that the stack does not overflow with
    CONFIG_FRAME_POINTER enabled.

    I have tested this fix with a simple program that uses unshare to
    create a deeply nested user namespace structure and then calls exit.
    With 1000 nesteuser namespaces before this change running my test
    program causes the kernel to die a horrible death. With 10,000,000
    nested user namespaces after this change my test program runs to
    completion and causes no harm.

    Acked-by: Serge Hallyn
    Pointed-out-by: Vasily Kulikov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman